Table Detection and Transformation Using TATR (Table Transform...

EN
E2E Networks

Content Team @ E2E Networks

January 15, 2024·13 min read
Share this article
Link copied to clipboard
Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Introduction

In the field of document analysis, the ongoing difficulty of extracting organized data from unstructured information has found a solution in the emergence of the Table Transformer. This pioneering adaptation of the DETR (DEtection TRansformer) model developed by Microsoft Research, housed within the Hugging Face Transformers framework, marks a significant advancement. This inventive model represents a major step forward by combining convolutional backbones with encoder-decoder Transformers, enabling exceptional performance in detecting tables and recognizing structures within documents.

Overview

The Table Transformer model, introduced in the research paper ‘PubTables-1M: Towards comprehensive table extraction from unstructured documents’ by Brandon Smock, Rohith Pesala, and Robin Abraham, presents a novel dataset, PubTables-1M. This dataset aims to set a benchmark for advancements in table extraction from unstructured documents, focusing on table structure recognition and functional analysis. The authors trained two DETR models within the Table Transformer framework: one for table detection and another for table structure recognition.

What Is DETR ?

The groundbreaking DETR model was introduced in the paper ‘End-to-End Object Detection with Transformers’ authored by Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. DETR comprises a convolutional backbone followed by an encoder-decoder Transformer, offering an end-to-end training approach for object detection. It simplifies the complexities inherent in models such as Faster-R-CNN and Mask-R-CNN, which rely on techniques like region proposals, non-maximum suppression, and anchor generation. Furthermore, DETR exhibits the potential for extension into panoptic segmentation by adding a mask head to the decoder outputs.

Abstract of DETR

This novel method presents object detection as a direct set prediction problem, marking a departure from conventional approaches. By streamlining the detection pipeline, this model eliminates the need for various manually crafted components such as non-maximum suppression and anchor generation, which typically encode task-specific prior knowledge. At the core of this new framework, named DEtection TRansformer (DETR), lies a set-based global loss that ensures unique predictions through bipartite matching and an architecture involving Transformer encoder-decoder layers. Using a predefined set of learned object queries, DETR comprehends object relations and the overall image context to directly generate a final set of predictions concurrently. This model is conceptually straightforward and doesn't rely on a specialized library, unlike several contemporary detectors. On the challenging COCO object detection dataset, DETR exhibits comparable accuracy and runtime performance to the well-established and highly optimized Faster R-CNN baseline. Moreover, DETR offers straightforward generalization for producing panoptic segmentation, outperforming competitive baselines significantly.

Abstract of TATR

The paper's abstract highlights recent progress in using machine learning for inferring and extracting table structures from unstructured documents. It acknowledges the challenge of creating large-scale datasets with accurate ground truth and introduces PubTables-1M as a solution. This dataset includes nearly one million tables from scientific articles, supporting various input formats and offering detailed header and location information for table structures. To enhance accuracy, it addresses ground truth inconsistencies observed in previous datasets by employing a novel canonicalization procedure. The research demonstrates the dataset's improvements in training and evaluating model performance for table structure recognition. Moreover, Transformer-based object detection models trained on PubTables-1M exhibit outstanding results across detection, structure recognition, and functional analysis without requiring specialized customization for these tasks.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

Understanding the Table Transformer

Central to its design, the DETR model, initially conceived for object detection and panoptic segmentation, relies on a foundational convolutional backbone such as ResNet-50 or ResNet-101, succeeded by an encoder-decoder Transformer architecture. What sets it apart is its streamlined methodology. In contrast to predecessors like Faster R-CNN or Mask R-CNN, which depend on intricate mechanisms like region proposals and anchor generation, DETR functions in an end-to-end manner. This simplicity is supported by an advanced loss function known as the bipartite matching loss, enabling uncomplicated training and refinement resembling the approach used in BERT models.

Advantages Over OCR

For a long time, Optical Character Recognition (OCR) has served as the conventional means of document analysis. Nevertheless, the Table Transformer offers distinct benefits:

  1. Structure Recognition: While OCR is proficient in text extraction, the Table Transformer surpasses it by not only extracting text but also recognizing and reconstructing table structures. Consequently, it retains the relational context of the data, showcasing a superior capability.
  2. End-to-End Training: In contrast to traditional OCR methods that often require multiple pre-processing steps and domain-specific adjustments, the Table Transformer's end-to-end training streamlines the workflow. This approach reduces the necessity for complex preprocessing, enhancing efficiency.
  3. Reduced Dependence on Templates: OCR heavily relies on predetermined templates, leading to rigidity in handling variations in document layouts. The Table Transformer's adaptability to diverse document structures significantly bolsters its robustness, reducing reliance on rigid templates.

Tutorial - Using TATR on E2E Cloud

If you require extra GPU resources for the tutorials ahead, you can explore the offerings on E2E CLOUD. E2E provides a diverse selection of GPUs, making them a suitable choice for more advanced LLM-based applications.

Make sure you add your ssh keys during launch, or through the security tab after launching.

Once you have launched a node, you can use VSCode Remote Explorer to ssh into the node and use it as a local development environment.

Tutorial: Table Transformer (TATR) for Table Detection and Extraction for OCR Application

Set Up the Environment

Let's start by installing Hugging Face Transformers and EasyOCR (an open-source OCR engine).

python
!pip install -q easyocr

Next, we load a Table Transformer pre-trained for table detection. We use the ‘no_timm’ version here to load the checkpoint with a Transformers-native backbone.

python
from transformers import AutoModelForObjectDetection model = AutoModelForObjectDetection.from_pretrained("microsoft/table-transformer-detection", revision="no_timm")
python
model.config.id2label

We move the model to a GPU if it's available (predictions will be faster).

python
import torch device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) print("")

Next, we can load a PDF image.

python
from PIL import Image from huggingface_hub import hf_hub_download # Loading an example image file_path = hf_hub_download(repo_id="nielsr/example-pdf", repo_type="dataset", filename="image.png") image = Image.open(file_path).convert("RGB") # let's display it a bit smaller width, height = image.size display(image.resize((int(0.6*width), (int(0.6*height)))))

Preparing the image for the model can be done as follows:

python
from torchvision import transforms class MaxResize(object):    def __init__(self, max_size=800):        self.max_size = max_size    def __call__(self, image):        width, height = image.size        current_max_size = max(width, height)        scale = self.max_size / current_max_size        resized_image = image.resize((int(round(scale*width)), int(round(scale*height))))        return resized_image detection_transform = transforms.Compose([    MaxResize(800),    transforms.ToTensor(),    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ])
python
pixel_values = detection_transform(image).unsqueeze(0) pixel_values = pixel_values.to(device) print(pixel_values.shape)

Next, we forward the pixel values through the model. The model outputs logits of shape (batch_size, num_queries, num_labels + 1). The +1 is for the ‘no object’ class.

python
import torch with torch.no_grad():  outputs = model(pixel_values)  

Next, we take the prediction that has an actual class (i.e. not ‘no object’).

python
# for output bounding box post-processing def box_cxcywh_to_xyxy(x):    x_c, y_c, w, h = x.unbind(-1)    b = [(x_c - 0.5 * w), (y_c - 0.5 * h), (x_c + 0.5 * w), (y_c + 0.5 * h)]    return torch.stack(b, dim=1) def rescale_bboxes(out_bbox, size):    img_w, img_h = size    b = box_cxcywh_to_xyxy(out_bbox)    b = b * torch.tensor([img_w, img_h, img_w, img_h], dtype=torch.float32)    return b # update id2label to include "no object" id2label = model.config.id2label id2label[len(model.config.id2label)] = "no object" def outputs_to_objects(outputs, img_size, id2label):    m = outputs.logits.softmax(-1).max(-1)    pred_labels = list(m.indices.detach().cpu().numpy())[0]    pred_scores = list(m.values.detach().cpu().numpy())[0]    pred_bboxes = outputs['pred_boxes'].detach().cpu()[0]    pred_bboxes = [elem.tolist() for elem in rescale_bboxes(pred_bboxes, img_size)]    objects = []    for label, score, bbox in zip(pred_labels, pred_scores, pred_bboxes):        class_label = id2label[int(label)]        if not class_label == 'no object':            objects.append({'label': class_label, 'score': loat(score),                             'bbox': [float(elem) for elem in bbox]})    return objects    
python
objects = outputs_to_objects(outputs, image.size, id2label) print(objects)

We can visualize the detection on the image.

python
import matplotlib.pyplot as plt import matplotlib.patches as patches from matplotlib.patches import Patch def fig2img(fig):    """Convert a Matplotlib figure to a PIL Image and return it"""    import io    buf = io.BytesIO()    fig.savefig(buf)    buf.seek(0)    img = Image.open(buf)    return img def visualize_detected_tables(img, det_tables, out_path=None):    plt.imshow(img, interpolation="lanczos")    fig = plt.gcf()    fig.set_size_inches(20, 20)    ax = plt.gca()    for det_table in det_tables:        bbox = det_table['bbox']        if det_table['label'] == 'table':            facecolor = (1, 0, 0.45)            edgecolor = (1, 0, 0.45)            alpha = 0.3            linewidth = 2            hatch='//////'        elif det_table['label'] == 'table rotated':            facecolor = (0.95, 0.6, 0.1)            edgecolor = (0.95, 0.6, 0.1)            alpha = 0.3            linewidth = 2            hatch='//////'        else:            continue        rect = patches.Rectangle(bbox[:2], bbox[2]-bbox[0], box[3]-bbox[1], linewidth=linewidth, edgecolor='none', facecolor=facecolor, alpha=0.1)        ax.add_patch(rect)        rect = patches.Rectangle(bbox[:2], bbox[2]-bbox[0], bbox[3]-bbox[1], linewidth=linewidth, edgecolor=edgecolor, facecolor='none',linestyle='-', alpha=alpha)        ax.add_patch(rect)        rect = patches.Rectangle(bbox[:2], bbox[2]-bbox[0], bbox[3]-bbox[1], linewidth=0, edgecolor=edgecolor, facecolor='none', linestyle='-', hatch=hatch, alpha=0.2)        ax.add_patch(rect)    plt.xticks([], [])    plt.yticks([], [])    legend_elements = [Patch(facecolor=(1, 0, 0.45), edgecolor=(1, 0, 0.45), label='Table', hatch='//////', alpha=0.3),                      Patch(facecolor=(0.95, 0.6, 0.1), edgecolor=(0.95, 0.6, 0.1),                               label='Table (rotated)', hatch='//////', alpha=0.3)]    plt.legend(handles=legend_elements, bbox_to_anchor=(0.5, -0.02), loc='upper center', borderaxespad=0, fontsize=10, ncol=2)    plt.gcf().set_size_inches(10, 10)    plt.axis('off')    if out_path is not None:      plt.savefig(out_path, bbox_inches='tight', dpi=150)    return fig fig = visualize_detected_tables(image, objects) visualized_image = fig2img(fig)

Next, we crop the table out of the image.

python
def objects_to_crops(img, tokens, objects, class_thresholds, padding=10):    table_crops = []    for obj in objects:        if obj['score'] < class_thresholds[obj['label']]:            continue        cropped_table = {}        bbox = obj['bbox']        bbox = [bbox[0]-padding, bbox[1]-padding, bbox[2]+padding, bbox[3]+padding]        cropped_img = img.crop(bbox)        table_tokens = [token for token in tokens if iob(token['bbox'], bbox) >= 0.5]        for token in table_tokens:            token['bbox'] = [token['bbox'][0]-bbox[0],                             token['bbox'][1]-bbox[1],                             token['bbox'][2]-bbox[0],                             token['bbox'][3]-bbox[1]]        if obj['label'] == 'table rotated':            cropped_img = cropped_img.rotate(270, expand=True)            for token in table_tokens:                bbox = token['bbox']                bbox = [cropped_img.size[0]-bbox[3]-1,                        bbox[0],                        cropped_img.size[0]-bbox[1]-1,                        bbox[2]]                token['bbox'] = bbox        cropped_table['image'] = cropped_img        cropped_table['tokens'] = table_tokens        table_crops.append(cropped_table)    return table_crops    
python
tokens = [] detection_class_thresholds = {    "table": 0.5,    "table rotated": 0.5,    "no object": 10 } crop_padding = 10 tables_crops = objects_to_crops(image, tokens, objects, detection_class_thresholds, padding=0) cropped_table = tables_crops[0]['image'].convert("RGB") cropped_table cropped_table.save("table.jpg")

Next, we load a Table Transformer pre-trained for table structure recognition.

python
from transformers import TableTransformerForObjectDetection # new v1.1 checkpoints require no timm anymore structure_model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-structure-recognition-v1.1-all") structure_model.to(device) print("")

We prepare the cropped table image for the model, and perform a forward pass.

python
structure_transform = transforms.Compose([    MaxResize(1000),    transforms.ToTensor(),    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) ]) pixel_values = structure_transform(cropped_table).unsqueeze(0) pixel_values = pixel_values.to(device) print(pixel_values.shape) # forward pass with torch.no_grad():  outputs = structure_model(pixel_values)

Next, we get the predicted detections.

python
# update id2label to include "no object" structure_id2label = structure_model.config.id2label structure_id2label[len(structure_id2label)] = "no object" cells = outputs_to_objects(outputs, cropped_table.size, structure_id2label) print(cells)

We can visualize all recognized cells using PIL's ImageDraw module.

python
from PIL import ImageDraw cropped_table_visualized = cropped_table.copy() draw = ImageDraw.Draw(cropped_table_visualized) for cell in cells:    draw.rectangle(cell["bbox"], outline="red") cropped_table_visualized

An alternative way of plotting is to select one class to visualize, like ‘table row’:

python
def plot_results(cells, class_to_visualize):    if class_to_visualize not in structure_model.config.id2label.values():      raise ValueError("Class should be one of the available classes")    plt.figure(figsize=(16,10))    plt.imshow(cropped_table)    ax = plt.gca()    for cell in cells:        score = cell["score"]        bbox = cell["bbox"]        label = cell["label"]        if label == class_to_visualize:          xmin, ymin, xmax, ymax = tuple(bbox)          ax.add_patch(plt.Rectangle((xmin, ymin), xmax - xmin, ymax - ymin, fill=False, color="red", linewidth=3))          text = f'{cell["label"]}: {score:0.2f}'          ax.text(xmin, ymin, text, fontsize=15, bbox=dict(facecolor='yellow', alpha=0.5))          plt.axis('off') plot_results(cells, class_to_visualize="table row")

Apply OCR Row by Row

First, we get the coordinates of the individual cells, row by row, by looking at the intersection of the rows and columns. Next, we apply OCR on each individual cell, row-by-row.

Alternatively, one could also do OCR column by column, and so on.

python
def get_cell_coordinates_by_row(table_data):    # Extract rows and columns    rows = [entry for entry in table_data if entry['label'] == 'table row']    columns = [entry for entry in table_data if entry['label'] == 'table column']    # Sort rows and columns by their Y and X coordinates, respectively    rows.sort(key=lambda x: x['bbox'][1])    columns.sort(key=lambda x: x['bbox'][0])    # Function to find cell coordinates    def find_cell_coordinates(row, column):        cell_bbox = [column['bbox'][0], row['bbox'][1], column['bbox'][2], row['bbox'][3]]        return cell_bbox    # Generate cell coordinates and count cells in each row    cell_coordinates = []    for row in rows:        row_cells = []        for column in columns:            cell_bbox = find_cell_coordinates(row, column)            row_cells.append({'column': column['bbox'], 'cell': cell_bbox})        # Sort cells in the row by X coordinate        row_cells.sort(key=lambda x: x['column'][0])        # Append row information to cell_coordinates        cell_coordinates.append({'row': row['bbox'], 'cells': row_cells, 'cell_count': len(row_cells)})    # Sort rows from top to bottom    cell_coordinates.sort(key=lambda x: x['row'][1])    return cell_coordinates cell_coordinates = get_cell_coordinates_by_row(cells)
python
len(cell_coordinates) len(cell_coordinates[0]["cells"]) for row in cell_coordinates:  print(row["cells"])
python
import numpy as np import csv import easyocr from tqdm.auto import tqdm reader = easyocr.Reader(['en']) # this needs to run only once to load the model into memory def apply_ocr(cell_coordinates):    # let's OCR row by row    data = dict()    max_num_columns = 0    for idx, row in enumerate(tqdm(cell_coordinates)):      row_text = []      for cell in row["cells"]:        # crop cell out of image        cell_image = np.array(cropped_table.crop(cell["cell"]))        # apply OCR        result = reader.readtext(np.array(cell_image))        if len(result) > 0:          # print([x[1] for x in list(result)])          text = " ".join([x[1] for x in result])          row_text.append(text)      if len(row_text) > max_num_columns:          max_num_columns = len(row_text)      data[idx] = row_text    print("Max number of columns:", max_num_columns)    # pad rows which don't have max_num_columns elements    # to make sure all rows have the same number of columns    for row, row_data in data.copy().items():        if len(row_data) != max_num_columns:          row_data = row_data + ["" for _ in range(max_num_columns - len(row_data))]        data[row] = row_data    return data data = apply_ocr(cell_coordinates) for row, row_data in data.items():    print(row_data)

We end up with a CSV file containing the data.

Conclusion

In summary, the Table Transformer represents a significant advancement in the field of document analysis, particularly concerning PDFs containing intricate tables. Its innovative integration of multi-modal capabilities has redefined the extraction of information from these complex documents.

This pioneering model, constructed upon the DETR framework, ushers in a new era by not only interpreting text within PDFs but also comprehensively identifying, reconstructing, and preserving the detailed structures of tables. Through the seamless combination of convolutional backbones and encoder-decoder Transformer architecture, it excels in both detecting tables and recognizing their structures.

Its advantages over traditional Optical Character Recognition (OCR) are manifold: from its ability to discern and rebuild table layouts to its end-to-end training that simplifies workflows and diminishes reliance on inflexible templates. The Table Transformer's adaptability to various document structures highlights its strength and flexibility.

As document analysis progresses, the Table Transformer's emergence as a potent tool represents a pivotal moment, promising efficiency, precision, and a more streamlined approach to extracting structured information from the vast collections of PDFs containing valuable tabular data. Its impact not only transforms document processing but also introduces possibilities for broader applications in diverse domains reliant on comprehensive information extraction from multi-modal documents.

References

TATR Repo: https://github.com/microsoft/table-transformer

Research Paper: PubTables-1M: Towards comprehensive table extraction from unstructured documents

Research Paper: End-to-End Object Detection with Transformers

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.