Input data #7

greav · 2020-02-20T15:26:05Z

Hello,
Could you provide your input data for the model to reproduce the results or at least the input data format so that I can try the model on my custom dataset

ziodos · 2020-02-21T14:59:04Z

Hello,
I couldn't find the exact input data to train the model on the icdar dataset , can you provide explanations for it ?
thanks.

vsymbol · 2020-03-11T02:50:06Z

Hello,
Could you provide your input data for the model to reproduce the results or at least the input data format so that I can try the model on my custom dataset

The project is refreshed with all history removed. All programs are runnable expect that the data example is not uploaded.

You may infer the correct data format from the data_loader_json.py file. Pull request is welcomed for making the project runnable out of the box. I'll let you know when the original data format can be provided, otherwise please feel free to create a pull request.

4kssoft · 2020-03-14T20:14:56Z

Please tell me if this is the correct data format?

Format:

file_name.json

{
  "global_attributes": {
    "file_id": "$file_name"
  },
  "fields":[
    {
      "field_name": "$class_name",
      "key_id": [],
      "key_text": [],
      "value_id": [$word_id],
      "value_text": "$word_text"
    },...
  ],

  "text_boxes":[
    {
      "id": $word_id,
      "bbox": [$word_x_min, $word_y_min, $word_x_max, $word_y_max],
      "text": "$word_text"
    },...
  ]
}

Example:

file1.json

{
  "global_attributes": {
    "file_id": "file1.jpg"
  },
  "fields":[
    {
      "field_name": "class1",
      "key_id": [],
      "key_text": [],
      "value_id": [1],
      "value_text": "sample1"
    },
    {
      "field_name": "class1",
      "key_id": [],
      "key_text": [],
      "value_id": [2],
      "value_text": "sample2"
    },
    {
      "field_name": "class2",
      "key_id": [],
      "key_text": [],
      "value_id": [3],
      "value_text": "sample3"
    }
  ],

  "text_boxes":[
    {
      "id": 1,
      "bbox": [10, 10, 50, 20],
      "text": "sample1"
    },
    {
      "id": 2,
      "bbox": [55, 10, 100, 20],
      "text": "sample2"
    },
    {
      "id": 3,
      "bbox": [50, 30, 100, 40],
      "text": "sample3"
    }
  ]
}

Or maybe the correct format should look like this
Example 2:

file1.json

{
  "global_attributes": {
    "file_id": "file1.jpg"
  },
  "fields":[
    {
      "field_name": "class1",
      "key_id": [],
      "key_text": [],
      "value_id": [1, 2],
      "value_text": ["sample1", "sample2"]
    },
    {
      "field_name": "class2",
      "key_id": [],
      "key_text": [],
      "value_id": [3],
      "value_text": ["sample3"]
    }
  ],

  "text_boxes":[
    {
      "id": 1,
      "bbox": [10, 10, 50, 20],
      "text": "sample1"
    },
    {
      "id": 2,
      "bbox": [55, 10, 100, 20],
      "text": "sample2"
    },
    {
      "id": 3,
      "bbox": [50, 30, 100, 40],
      "text": "sample3"
    }
  ]
}

langheran · 2020-05-26T14:51:02Z

Hello @vsymbol , http://52.193.30.103 seems to be down. Could you provide the updated link?

varshaneya · 2020-06-16T06:53:30Z

Hello @vsymbol , http://52.193.30.103 seems to be down. Could you provide the updated link?

I too am unable to reach 52.193.30.103 even via ping. Can you confirm if this is up?

4kssoft · 2020-06-16T09:56:39Z

Sample data file #8 (comment)

vsymbol · 2020-06-16T13:38:15Z

Hello @vsymbol , http://52.193.30.103 seems to be down. Could you provide the updated link?

I too am unable to reach 52.193.30.103 even via ping. Can you confirm if this is up?

Hi varshaneya, the link is down.

samhita-alla · 2020-06-23T12:07:24Z

Hi,

Can you please provide details on how to generate and test the model? There are a whole lot of files and command line arguments to be given. Can you please update README as to how this model has to be trained and tested?

Ibmaria · 2020-06-27T10:45:06Z

Hello @vsymbol
Please how do you get this format ?.I have no idea .Can you explain me please?
Thanks in advance

4kssoft · 2020-06-29T21:55:51Z

@Ibmaria
Hello 4kssoft,
Please how do you get this format ?.I have no idea .Can you explain me please?
Thanks in advance

To get the format. I analyzed this file https://github.com/vsymbol/CUTIE/blob/master/data_loader_json.py

@samhita-alla

I'm training a model with these parameters:

python main_train_json.py \ --doc_path 'invoice_data/' \ --save_prefix 'INVOICE' \ --test_path '' \ --embedding_file '' \ --ckpt_path 'graph/' \ --ckpt_file 'CUTIE_highresolution_8x_d20000c9(r80c80_iter_40000.ckpt' \ --tokenize True \ --update_dict True \ --dict_path 'dict/' \ --rows_segment 72 \ --cols_segment 72 \ --augment_strategy 1 \ --positional_mapping_strategy 1 \ --rows_target 64 \ --cols_target 64 \ --rows_ulimit 80 \ --fill_bbox False \ --data_augmentation_extra True \ --data_augmentation_dropout 1 \ --data_augmentation_extra_rows 16 \ --data_augmentation_extra_cols 16 \ --batch_size 32 \ --iterations 40000 \ --lr_decay_step 13000 \ --learning_rate 0.0001 \ --lr_decay_factor 0.1 \ --hard_negative_ratio 3 \ --use_ghm 0 \ --ghm_bins 30 \ --ghm_momentum 0 \ --log_path 'log/' \ --log_disp_step 100 \ --log_save_step 100 \ --validation_step 100 \ --test_step 400 \ --ckpt_save_step 50 \ --embedding_size 128 \ --weight_decay 0.0005 \ --eps 1e-6

Ibmaria · 2020-06-30T14:27:34Z

@4kssoft
Thanks a lot for sharing how to train the model . however how engine(api) did you use to get the boxes coordinates from the images ?
Thanks

4kssoft · 2020-06-30T21:09:10Z

@4kssoft
Thanks a lot for sharing how to train the model . however how engine(api) did you use to get the boxes coordinates from the images ?
Thanks

I use own software for labeling documents (https://www.youtube.com/watch?v=1okRMNxC0ec)

Ibmaria · 2020-07-01T06:37:15Z

@4kssoft
Thanks you !

gandalf012 · 2020-07-04T12:41:28Z

@4kssoft

@4kssoft
Thanks a lot for sharing how to train the model . however how engine(api) did you use to get the boxes coordinates from the images ?
Thanks

I use own software for labeling documents (https://www.youtube.com/watch?v=1okRMNxC0ec)

@4kssoft Thanks for sharing. How can i access you tool ?

4kssoft · 2020-07-06T18:36:45Z

@4kssoft

@4kssoft
Thanks a lot for sharing how to train the model . however how engine(api) did you use to get the boxes coordinates from the images ?
Thanks

I use own software for labeling documents (https://www.youtube.com/watch?v=1okRMNxC0ec)

@4kssoft Thanks for sharing. How can i access you tool ?

This is a beta version for now. I plan to publish this software, but not as open source

varshaneya · 2020-07-07T06:01:12Z

@Ibmaria
Hello 4kssoft,
Please how do you get this format ?.I have no idea .Can you explain me please?
Thanks in advance

To get the format. I analyzed this file https://github.com/vsymbol/CUTIE/blob/master/data_loader_json.py

@samhita-alla

I'm training a model with these parameters:

python main_train_json.py \ --doc_path 'invoice_data/' \ --save_prefix 'INVOICE' \ --test_path '' \ --embedding_file '' \ --ckpt_path 'graph/' \ --ckpt_file 'CUTIE_highresolution_8x_d20000c9(r80c80_iter_40000.ckpt' \ --tokenize True \ --update_dict True \ --dict_path 'dict/' \ --rows_segment 72 \ --cols_segment 72 \ --augment_strategy 1 \ --positional_mapping_strategy 1 \ --rows_target 64 \ --cols_target 64 \ --rows_ulimit 80 \ --fill_bbox False \ --data_augmentation_extra True \ --data_augmentation_dropout 1 \ --data_augmentation_extra_rows 16 \ --data_augmentation_extra_cols 16 \ --batch_size 32 \ --iterations 40000 \ --lr_decay_step 13000 \ --learning_rate 0.0001 \ --lr_decay_factor 0.1 \ --hard_negative_ratio 3 \ --use_ghm 0 \ --ghm_bins 30 \ --ghm_momentum 0 \ --log_path 'log/' \ --log_disp_step 100 \ --log_save_step 100 \ --validation_step 100 \ --test_step 400 \ --ckpt_save_step 50 \ --embedding_size 128 \ --weight_decay 0.0005 \ --eps 1e-6

Could you please provide the ckpt file CUTIE_highresolution_8x_d20000c9(r80c80_iter_40000.ckpt and the invoice dataset that you had used for training?

sathvikask0 · 2020-07-09T06:33:25Z

@Ibmaria
Hello 4kssoft,
Please how do you get this format ?.I have no idea .Can you explain me please?
Thanks in advance

To get the format. I analyzed this file https://github.com/vsymbol/CUTIE/blob/master/data_loader_json.py

@samhita-alla

I'm training a model with these parameters:

python main_train_json.py \ --doc_path 'invoice_data/' \ --save_prefix 'INVOICE' \ --test_path '' \ --embedding_file '' \ --ckpt_path 'graph/' \ --ckpt_file 'CUTIE_highresolution_8x_d20000c9(r80c80_iter_40000.ckpt' \ --tokenize True \ --update_dict True \ --dict_path 'dict/' \ --rows_segment 72 \ --cols_segment 72 \ --augment_strategy 1 \ --positional_mapping_strategy 1 \ --rows_target 64 \ --cols_target 64 \ --rows_ulimit 80 \ --fill_bbox False \ --data_augmentation_extra True \ --data_augmentation_dropout 1 \ --data_augmentation_extra_rows 16 \ --data_augmentation_extra_cols 16 \ --batch_size 32 \ --iterations 40000 \ --lr_decay_step 13000 \ --learning_rate 0.0001 \ --lr_decay_factor 0.1 \ --hard_negative_ratio 3 \ --use_ghm 0 \ --ghm_bins 30 \ --ghm_momentum 0 \ --log_path 'log/' \ --log_disp_step 100 \ --log_save_step 100 \ --validation_step 100 \ --test_step 400 \ --ckpt_save_step 50 \ --embedding_size 128 \ --weight_decay 0.0005 \ --eps 1e-6

@4kssoft if possible please provide the pretrained model that you are using!

And guys for the annotation with bounding boxes please look into this link, might be useful :
Tesseract OCR: Text localization and detection

Hrishkesh · 2020-07-15T21:39:02Z

@4kssoft Hi I have my own data and extracted text using OCR tesseract and got the position of each word, can i know how to get in the format you showed an example in your repository for sample pdf file Faktura1.pdf_0.json how to get in this format and i need in the format you done can you let me know ???

Neelesh1121 · 2020-07-30T08:28:21Z

@4kssoft Thanks for your suggestions, I have generated my own training datasets and i am able to train the model, but I am not getting what should be the input format to predict the result. If you know what modification it requires to get the result please just inform us.

4kssoft · 2020-07-31T11:27:57Z

Hello all

@4kssoft if possible please provide the pretrained model that you are using!

@sathvikask0
Sorry but unfortunately I cannot share my model

@4kssoft Hi I have my own data and extracted text using OCR tesseract and got the position of each word, can i know how to > get in the format you showed an example in your repository for sample pdf file Faktura1.pdf_0.json how to get in this format > > and i need in the format you done can you let me know ???

@Hrishkesh
as I wrote #7 (comment) earlier, I use my own tool to annotate documents, I also have ready functions for exporting training data to various models.
I'm planning to publish a beta version of my solution soon

@4kssoft Thanks for your suggestions, I have generated my own training datasets and i am able to train the model, but I am >not getting what should be the input format to predict the result. If you know what modification it requires to get the result ?>please just inform us.

@Neelesh1121
The format is the same as for training. Look at the https://github.com/vsymbol/CUTIE/blob/master/main_evaluate_json.py script

vishal7894 · 2020-08-31T14:56:11Z

everytime i try to use main_evaluate_json.py I get this error

@4kssoft @samhita-alla @vsymbol
Can anyone please help

2 root error(s) found.
(0) Not found: Key feature_fuser/biases not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Not found: Key feature_fuser/biases not found in checkpoint
[[node save/RestoreV2 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[save/RestoreV2/_49]]
0 successful operations.
0 derived errors ignored.

n0ct4li · 2020-09-28T07:45:37Z

@Ibmaria
Hello 4kssoft,
Please how do you get this format ?.I have no idea .Can you explain me please?
Thanks in advance

To get the format. I analyzed this file https://github.com/vsymbol/CUTIE/blob/master/data_loader_json.py

@samhita-alla

I'm training a model with these parameters:

python main_train_json.py \ --doc_path 'invoice_data/' \ --save_prefix 'INVOICE' \ --test_path '' \ --embedding_file '' \ --ckpt_path 'graph/' \ --ckpt_file 'CUTIE_highresolution_8x_d20000c9(r80c80_iter_40000.ckpt' \ --tokenize True \ --update_dict True \ --dict_path 'dict/' \ --rows_segment 72 \ --cols_segment 72 \ --augment_strategy 1 \ --positional_mapping_strategy 1 \ --rows_target 64 \ --cols_target 64 \ --rows_ulimit 80 \ --fill_bbox False \ --data_augmentation_extra True \ --data_augmentation_dropout 1 \ --data_augmentation_extra_rows 16 \ --data_augmentation_extra_cols 16 \ --batch_size 32 \ --iterations 40000 \ --lr_decay_step 13000 \ --learning_rate 0.0001 \ --lr_decay_factor 0.1 \ --hard_negative_ratio 3 \ --use_ghm 0 \ --ghm_bins 30 \ --ghm_momentum 0 \ --log_path 'log/' \ --log_disp_step 100 \ --log_save_step 100 \ --validation_step 100 \ --test_step 400 \ --ckpt_save_step 50 \ --embedding_size 128 \ --weight_decay 0.0005 \ --eps 1e-6

@4kssoft Do you generate your own dictionnary? I don't really understand the part "Generate your own dictionary with main_build_dict.py / main_data_tokenizer.py". Can you explain how to apply this process on own dataset? Thanks

Also, to what the ckpt_path argument refers to?

Karthik1904 · 2020-10-06T03:36:34Z

Hello @vsymbol

can you please give brief about how to generate the texts and corresponding bounding boxes & manually labelling each text and their bounding box

Which tools we have use for manually labelling

mohammedayub44 · 2020-10-20T01:07:22Z

@4kssoft Thanks for the labeling video. Does your software export in the format required by CUTIE (json template you provided) or you have to run explicit post processing ?
In the json example you provided, what does "key_id" and "key_value" represent ? all of them look empty.

vsymbol · 2020-10-22T09:32:31Z

Hello @vsymbol

can you please give brief about how to generate the texts and corresponding bounding boxes & manually labelling each text and their bounding box

Which tools we have use for manually labelling

Apply any OCR tool that help you detecting and recognizing words in the scanned document image.
For example, refer to what @4kssoft has done to the document image and generated a .json file with position and text of the image.

https://github.com/4kssoft/CUTIE/blob/master/invoice_data/Faktura1.pdf_0.json

hhien · 2020-12-01T08:07:40Z

@Hrishkesh , @sathvikask0, @Karthik1904
Guys, I have written a simple file to run Tesseract ocr and output a json file in the format as in invoice_data/ example:
https://github.com/hhien/tesseract_applications.git

shrivastavapankajj · 2020-12-20T03:37:45Z

Did someone able to train and test the model? I couldn't find how to predict on new data .

hhien · 2021-02-14T22:59:36Z

Hi, if you have ground truth data in a different format, you can simply read that data and fill in these field_names, otherwise, you have to fill it manually. The script I wrote only does Teserract OCR on the image and output in the format that CUTIE needs.

…

On Thu, Feb 11, 2021 at 5:30 PM ywsyws ***@***.***> wrote: @Hrishkesh <https://github.com/Hrishkesh> , @sathvikask0 <https://github.com/sathvikask0>, @Karthik1904 <https://github.com/Karthik1904> Guys, I have written a simple file to run Tesseract ocr and output a json file in the format as in invoice_data/ example: https://github.com/hhien/tesseract_applications.git @hhien <https://github.com/hhien> Thank you so much for the script. one question: It doesn't fill in the values of each field_name. Did you manuelly fill it up (which I doubt)? or you did another script to do it? Thank you for your help! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE7ZO655LL2P4OYVT4F42F3S6QAYRANCNFSM4KYRDGHA> .

fj0n · 2021-04-02T14:19:16Z

Did someone able to train and test the model? I couldn't find how to predict on new data .

I'm struggling with it. So far I was able to create the .json-files, with the solution of hhien's code.

I anyone succeeded, I'm thankful for any recommendation on how to train the model.

darsh169 · 2021-06-05T05:35:39Z

Please tell me if this is the correct data format?

Format:

file_name.json

{
  "global_attributes": {
    "file_id": "$file_name"
  },
  "fields":[
    {
      "field_name": "$class_name",
      "key_id": [],
      "key_text": [],
      "value_id": [$word_id],
      "value_text": "$word_text"
    },...
  ],

  "text_boxes":[
    {
      "id": $word_id,
      "bbox": [$word_x_min, $word_y_min, $word_x_max, $word_y_max],
      "text": "$word_text"
    },...
  ]
}

Example:

file1.json

{
  "global_attributes": {
    "file_id": "file1.jpg"
  },
  "fields":[
    {
      "field_name": "class1",
      "key_id": [],
      "key_text": [],
      "value_id": [1],
      "value_text": "sample1"
    },
    {
      "field_name": "class1",
      "key_id": [],
      "key_text": [],
      "value_id": [2],
      "value_text": "sample2"
    },
    {
      "field_name": "class2",
      "key_id": [],
      "key_text": [],
      "value_id": [3],
      "value_text": "sample3"
    }
  ],

  "text_boxes":[
    {
      "id": 1,
      "bbox": [10, 10, 50, 20],
      "text": "sample1"
    },
    {
      "id": 2,
      "bbox": [55, 10, 100, 20],
      "text": "sample2"
    },
    {
      "id": 3,
      "bbox": [50, 30, 100, 40],
      "text": "sample3"
    }
  ]
}

Or maybe the correct format should look like this
Example 2:

file1.json

{
  "global_attributes": {
    "file_id": "file1.jpg"
  },
  "fields":[
    {
      "field_name": "class1",
      "key_id": [],
      "key_text": [],
      "value_id": [1, 2],
      "value_text": ["sample1", "sample2"]
    },
    {
      "field_name": "class2",
      "key_id": [],
      "key_text": [],
      "value_id": [3],
      "value_text": ["sample3"]
    }
  ],

  "text_boxes":[
    {
      "id": 1,
      "bbox": [10, 10, 50, 20],
      "text": "sample1"
    },
    {
      "id": 2,
      "bbox": [55, 10, 100, 20],
      "text": "sample2"
    },
    {
      "id": 3,
      "bbox": [50, 30, 100, 40],
      "text": "sample3"
    }
  ]
}

what are bbox entries? x1,y1,widht,height? or x1,y1(top left),x2,y2(bottom right)

gibotsgithub · 2021-12-13T13:31:32Z

i have created the json files in the required format. i have 400 invoices data. the main_train_json.py gets killed because it utilises all the RAM. has anyone faced this issue? I have 16 gb of ram.

Ajithbalakrishnan · 2022-06-01T19:12:09Z

Anyone pls share the inference script?

4kssoft mentioned this issue Mar 15, 2020

grid_label minus values ? #8

Open

4kssoft mentioned this issue Oct 22, 2020

Which labelling or annoation tool you are using? #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input data #7

Input data #7

greav commented Feb 20, 2020 •

edited

Loading

ziodos commented Feb 21, 2020

vsymbol commented Mar 11, 2020

4kssoft commented Mar 14, 2020 •

edited

Loading

langheran commented May 26, 2020

varshaneya commented Jun 16, 2020 •

edited

Loading

4kssoft commented Jun 16, 2020

vsymbol commented Jun 16, 2020

samhita-alla commented Jun 23, 2020

Ibmaria commented Jun 27, 2020 •

edited

Loading

4kssoft commented Jun 29, 2020

Ibmaria commented Jun 30, 2020 •

edited

Loading

4kssoft commented Jun 30, 2020

Ibmaria commented Jul 1, 2020

gandalf012 commented Jul 4, 2020 •

edited

Loading

4kssoft commented Jul 6, 2020

varshaneya commented Jul 7, 2020

sathvikask0 commented Jul 9, 2020

Hrishkesh commented Jul 15, 2020

Neelesh1121 commented Jul 30, 2020

4kssoft commented Jul 31, 2020

vishal7894 commented Aug 31, 2020

n0ct4li commented Sep 28, 2020 •

edited

Loading

Karthik1904 commented Oct 6, 2020

mohammedayub44 commented Oct 20, 2020

vsymbol commented Oct 22, 2020

hhien commented Dec 1, 2020

shrivastavapankajj commented Dec 20, 2020

hhien commented Feb 14, 2021 via email

fj0n commented Apr 2, 2021

darsh169 commented Jun 5, 2021

gibotsgithub commented Dec 13, 2021

Ajithbalakrishnan commented Jun 1, 2022

Input data #7

Input data #7

Comments

greav commented Feb 20, 2020 • edited Loading

ziodos commented Feb 21, 2020

vsymbol commented Mar 11, 2020

4kssoft commented Mar 14, 2020 • edited Loading

langheran commented May 26, 2020

varshaneya commented Jun 16, 2020 • edited Loading

4kssoft commented Jun 16, 2020

vsymbol commented Jun 16, 2020

samhita-alla commented Jun 23, 2020

Ibmaria commented Jun 27, 2020 • edited Loading

4kssoft commented Jun 29, 2020

Ibmaria commented Jun 30, 2020 • edited Loading

4kssoft commented Jun 30, 2020

Ibmaria commented Jul 1, 2020

gandalf012 commented Jul 4, 2020 • edited Loading

4kssoft commented Jul 6, 2020

varshaneya commented Jul 7, 2020

sathvikask0 commented Jul 9, 2020

Hrishkesh commented Jul 15, 2020

Neelesh1121 commented Jul 30, 2020

4kssoft commented Jul 31, 2020

vishal7894 commented Aug 31, 2020

n0ct4li commented Sep 28, 2020 • edited Loading

Karthik1904 commented Oct 6, 2020

mohammedayub44 commented Oct 20, 2020

vsymbol commented Oct 22, 2020

hhien commented Dec 1, 2020

shrivastavapankajj commented Dec 20, 2020

hhien commented Feb 14, 2021 via email

fj0n commented Apr 2, 2021

darsh169 commented Jun 5, 2021

gibotsgithub commented Dec 13, 2021

Ajithbalakrishnan commented Jun 1, 2022

greav commented Feb 20, 2020 •

edited

Loading

4kssoft commented Mar 14, 2020 •

edited

Loading

varshaneya commented Jun 16, 2020 •

edited

Loading

Ibmaria commented Jun 27, 2020 •

edited

Loading

Ibmaria commented Jun 30, 2020 •

edited

Loading

gandalf012 commented Jul 4, 2020 •

edited

Loading

n0ct4li commented Sep 28, 2020 •

edited

Loading