
HMER: Image to LaTeX Converter
Project Impact
Project Gallery




HMER: Image to LaTeX Converter
A deep learning-based system for converting images of mathematical expressions into their corresponding LaTeX code representation.
Project Overview
The Image to LaTeX (img2latex) project implements a deep learning-based system for converting images of mathematical expressions into LaTeX code. This technology addresses a significant challenge in digital document processing: transforming visual representations of mathematical formulas into their corresponding markup representation, which is essential for editing, searching, and accessibility.
Mathematical expressions are ubiquitous in scientific, engineering, and academic literature, but transferring them between different formats can be cumbersome. Traditional Optical Character Recognition (OCR) systems often struggle with the complex two-dimensional structure of mathematical formulas. The img2latex project provides an end-to-end solution to automatically recognize and transcribe mathematical expressions from images, significantly reducing the manual effort required for digitizing printed mathematical content.
Key Features
- Dual Model Architectures: Implementation of both CNN-LSTM and ResNet-LSTM architectures for comparative analysis
- Multiple Decoding Strategies: Support for greedy search, beam search, and sampling with temperature/top-k/top-p
- Comprehensive Evaluation: Performance assessment using token-level accuracy, BLEU score, and Levenshtein similarity
- Visualization Tools: Advanced visualization of training metrics, dataset statistics, and model predictions
- Cross-Platform Support: Acceleration on Apple Silicon (MPS), NVIDIA GPUs (CUDA), and CPU fallback
- Command-line Interface: Easy-to-use CLI for training, evaluation, and prediction
Technical Approach
Dataset Analysis
The project leverages the IM2LaTeX-100k dataset, consisting of over 100,000 images of mathematical expressions paired with their corresponding LaTeX code.
Key Dataset Statistics:
- Total Images: 103,536
- Mean Dimensions: 319.2px × 61.2px
- Mean Aspect Ratio: 5.79
- Common Size: 320×64 px
- Image Format: RGB (uint8)
- Pixel Value Distribution: Mean 242.22, StdDev 45.70
Model Architecture
1. CNN-LSTM Architecture
The CNN-LSTM model consists of:
-
Encoder: A convolutional neural network with three convolutional blocks, each containing:
- Conv2D layer (with filters [32, 64, 128])
- ReLU activation
- MaxPooling layer
- The final output is flattened and passed through a dense layer to create the embedding
-
Decoder: An LSTM-based decoder that:
- Takes the encoder output and previously generated tokens as input
- Generates output tokens one at a time
- Uses teacher forcing during training (ground truth tokens as input)
- Offers optional attention mechanism to focus on different parts of the encoder representation
2. ResNet-LSTM Architecture
The ResNet-LSTM model replaces the CNN encoder with a pre-trained ResNet:
-
Encoder: A pre-trained ResNet (options include ResNet18, ResNet34, ResNet50, ResNet101, ResNet152) with:
- The classification head removed
- Option to freeze weights for transfer learning
- Final layer adapted to produce embeddings of the desired dimension
-
Decoder: The same LSTM-based decoder as the CNN-LSTM model
Training Process
The training process implements several key strategies:
Optimization Setup
- Optimizer: Adam with configurable learning rate and weight decay
- Learning Rate Scheduling: ReduceLROnPlateau with patience 3, factor 0.5
- Loss Function: Cross-entropy with label smoothing (0.1)
Training Techniques
- Teacher Forcing: Scheduled sampling approach transitioning from ground truth to predictions
- Gradient Clipping: Norm-based clipping (value: 5.0) to prevent exploding gradients
- Early Stopping: Training stops if validation metrics don't improve for 5 epochs
- Checkpointing: Regular saving of model checkpoints for resuming training
Hardware Acceleration
- Device Support: CUDA for NVIDIA GPUs, MPS for Apple Silicon, CPU fallback
- Mixed Precision: FP16 computation where supported (30-40% faster training)
Results and Evaluation
The training process spanned 25 epochs, with the following progression in validation metrics for our best-performing model:
Epoch | Loss | Accuracy | BLEU | Levenshtein |
---|---|---|---|---|
1 | 2.2778 | 0.4986 | 0.0827 | 0.2311 |
5 | 1.8408 | 0.5760 | 0.1241 | 0.2609 |
10 | 1.6909 | 0.6022 | 0.1377 | 0.2716 |
15 | 1.6338 | 0.6116 | 0.1464 | 0.2781 |
20 | 1.6030 | 0.6180 | 0.1502 | 0.2799 |
25 | 1.5663 | 0.6256 | 0.1539 | 0.2829 |
The comparison between CNN-LSTM and ResNet-LSTM models showed:
- CNN-LSTM achieved 62.56% validation accuracy and a BLEU score of 0.1539
- ResNet50-LSTM achieved 59.42% accuracy and 0.1487 BLEU score in fewer epochs
- The CNN-LSTM architecture provided superior results with lower computational requirements
Implementation Details
Preprocessing Pipeline
The preprocessing pipeline includes:
- Grayscale conversion and normalization
- Resizing while maintaining aspect ratio
- Padding to consistent dimensions
- Data augmentation techniques including random rotations, translations, and scaling
Vocabulary and Tokenization
A specialized tokenizer handles the LaTeX syntax:
- Special tokens for start/end of sequence, padding, and unknown tokens
- Support for common LaTeX commands and mathematical symbols
- Conversion between tokens and character-level representation
Inference Methods
Multiple decoding strategies are supported:
- Greedy Search: Always selects the token with highest probability
- Beam Search: Maintains top-k hypotheses at each decoding step
- Sampling-based Methods: Temperature scaling, top-k sampling, and nucleus sampling
Future Enhancements
Planned improvements include:
- Integration of transformer-based architectures (e.g., Vision Transformer encoder with BART decoder)
- Support for handwritten mathematical expressions
- Expansion to handle more complex LaTeX structures like tables and diagrams
- Web API deployment for broader accessibility
- Performance optimization for mobile devices
Technologies Used
- Python: Core implementation language
- PyTorch: Deep learning framework for model development
- NumPy & SciPy: Numerical computing and scientific functions
- Matplotlib & Seaborn: Visualization of results and metrics
- NLTK & SacreBLEU: Natural language evaluation metrics
- Hydra: Configuration management framework