Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

DeepLearning on Imagenet with mxnet issues translating .lst to .rec files #9766

Closed
stonedl3 opened this issue Feb 11, 2018 · 6 comments
Closed

Comments

@stonedl3
Copy link

stonedl3 commented Feb 11, 2018

Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.

For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io

Description

I ran the following command and got double the expected output
$ ~/mxnet/bin/im2rec imagenet/lists/train.lst "" \imagenet/rec/train.rec \ resize=256 encoding=’.jpg’
\quality=100

The output I got from running the command
$ ls -l imagenet/rec/
total 217313924
-rw-rw-r-- 1 stonedl3 stonedl3 8310150340 Feb 4 01:27 test.rec
-rw-rw-r-- 1 stonedl3 stonedl3 205306062916 Feb 4 00:46 train.rec
-rw-rw-r-- 1 stonedl3 stonedl3 8913201356 Feb 4 01:10 val.rec

I then tried to train alexnet and got an error
[10:57:18] /home/stonedl3/mxnet/dmlc-core/include/dmlc/./logging.h:308: [10:57:18] src/io/image_aug_default.cc:300: Check failed: static_cast<index_t>(res.rows) >= param_.data_shape[1] && static_cast<index_t>(res.cols) >= param_.data_shape[2] input image size smaller than input shape

Environment info (Required)

  • The machine i am using to train ImageNet
    I am using an HP omen desktop with the following specs:
    Graphics Cards two NVIDIA GTX 1080 ti GPUs
    Memory 31.9 GB
Processor Intel® Core™ i7-7700K CPU @ 4.20GHz × 8
OS Type  64 bit UBUNTU16.04
Hard Disk 1.9 TB
I configured my environment using instructions from Pyimagesearch

I downloaded and installed the latest version of mxnet

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.
(dl4cv) stonedl3@stonedl3:~$ python diagnose.py
----------Python Info----------
Version      : 3.5.2
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Nov 23 2017 16:37:01')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 9.0.1
Directory    : /home/stonedl3/.virtualenvs/dl4cv/lib/python3.5/site-packages/pip
----------MXNet Info-----------
Version      : 0.11.0
Directory    : /home/stonedl3/.virtualenvs/dl4cv/lib/python3.5/site-packages/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.13.0-32-generic-x86_64-with-Ubuntu-16.04-xenial
system       : Linux
node         : stonedl3
release      : 4.13.0-32-generic
version      : #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43 UTC 2018
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 158
Model name:            Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Stepping:              9
CPU MHz:               4200.000
CPU max MHz:           4500.0000
CPU min MHz:           800.0000
BogoMIPS:              8400.00
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
----------Network Test----------
Setting timeout: 10
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0877 sec, LOAD: 0.1309 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0910 sec, LOAD: 0.2431 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0347 sec, LOAD: 0.5884 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0481 sec, LOAD: 0.1984 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0695 sec, LOAD: 0.5899 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1626 sec, LOAD: 0.9258 sec.


Package used (Python/R/Scala/Julia):
I am running python 3.5 in a virtual environment

For Scala user, please provide:
1. Java version: (`java -version`)
2. Maven version: (`mvn -version`)
3. Scala runtime if applicable: (`scala -version`)

For R user, please provide R `sessionInfo()`:

## Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):

MXNet commit hash:
(Paste the output of `git rev-parse HEAD` here.) no hash

Build config:
(Paste the content of config.mk, or the build command.)

## Error Message:
(Paste the complete error message, including stack trace.)

## Minimum reproducible example
(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

## Steps to reproduce
(Paste the commands you ran that produced the error.)

1.
2.

## What have you tried to solve it?

1.
2.
@anirudh2290
Copy link
Member

Can you please also provide the script that you ran to reproduce the issue ? I see that you are using older version of MXNet : 0.11.0 . Have you tried 1.0.0 ?

@stonedl3
Copy link
Author

stonedl3 commented Feb 14, 2018 via email

@stonedl3
Copy link
Author

Here is the script used to train alexnet

USAGE

python train_alexnet.py --checkpoints checkpoints --prefix alexnet

python train_alexnet.py --checkpoints checkpoints --prefix alexnet --start-epoch 25

import the necessary packages

from config import imagenet_alexnet_config as config
from pyimagesearch.nn.mxconv import MxAlexNet
import mxnet as mx
import argparse
import logging
import json
import os

construct the argument parse and parse the arguments

ap = argparse.ArgumentParser()
ap.add_argument("-c", "--checkpoints", required=True,
help="path to output checkpoint directory")
ap.add_argument("-p", "--prefix", required=True,
help="name of model prefix")
ap.add_argument("-s", "--start-epoch", type=int, default=0,
help="epoch to restart training at")
args = vars(ap.parse_args())

set the logging level and output file

logging.basicConfig(level=logging.DEBUG,
filename="training_{}.log".format(args["start_epoch"]),
filemode="w")

load the RGB means for the training set, then determine the batch

size

means = json.loads(open(config.DATASET_MEAN).read())
batchSize = config.BATCH_SIZE * config.NUM_DEVICES

construct the training image iterator

trainIter = mx.io.ImageRecordIter(
path_imgrec=config.TRAIN_MX_REC,
data_shape=(3, 224, 224),
batch_size=batchSize,
rand_crop=True,
rand_mirror=True,
rotate=15,
max_shear_ratio=0.1,
mean_r=means["R"],
mean_g=means["G"],
mean_b=means["B"],
preprocess_threads=config.NUM_DEVICES * 2)

construct the validation image iterator

valIter = mx.io.ImageRecordIter(
path_imgrec=config.VAL_MX_REC,
data_shape=(3, 224, 224),
batch_size=batchSize,
mean_r=means["R"],
mean_g=means["G"],
mean_b=means["B"])

initialize the optimizer

opt = mx.optimizer.SGD(learning_rate=1e-2, momentum=0.9, wd=0.0005,
rescale_grad=1.0 / batchSize)

construct the checkpoints path, initialize the model argument and

auxiliary parameters

checkpointsPath = os.path.sep.join([args["checkpoints"],
args["prefix"]])
argParams = None
auxParams = None

if there is no specific model starting epoch supplied, then

initialize the network

if args["start_epoch"] <= 0:
# build the LeNet architecture
print("[INFO] building network...")
model = MxAlexNet.build(config.NUM_CLASSES)

otherwise, a specific checkpoint was supplied

else:
# load the checkpoint from disk
print("[INFO] loading epoch {}...".format(args["start_epoch"]))
model = mx.model.FeedForward.load(checkpointsPath,
args["start_epoch"])

# update the model and parameters
argParams = model.arg_params
auxParams = model.aux_params
model = model.symbol

compile the model

model = mx.mod.Module(
context=[mx.gpu(0), mx.gpu(1)],
symbol=model)

initialize the callbacks and evaluation metrics

batchEndCBs = [mx.callback.Speedometer(batchSize, 500)]
epochEndCBs = [mx.callback.do_checkpoint(checkpointsPath)]
metrics = [mx.metric.Accuracy(), mx.metric.TopKAccuracy(top_k=5),
mx.metric.CrossEntropy()]

train the network

print("[INFO] training network...")
model.fit(
train_data=trainIter,
eval_data=valIter,
eval_metric=metrics,
batch_end_callback=batchEndCBs,
epoch_end_callback=epochEndCBs,
initializer=mx.initializer.Xavier(),
arg_params=argParams,
aux_params=auxParams,
optimizer=opt,
num_epoch=65,
begin_epoch=args["start_epoch"])

@stonedl3
Copy link
Author

Here is the complete error trace
python train_alexnet.py --checkpoints checkpoints --prefix alexnet
[22:21:05] src/io/iter_image_recordio_2.cc:153: ImageRecordIOParser2: /home/stonedl3/dl4cv/IB_Code/datasets/imagenet/rec/train.rec, use 3 threads for decoding..
[22:21:05] /home/stonedl3/mxnet/dmlc-core/include/dmlc/./logging.h:308: [22:21:05] src/io/image_aug_default.cc:300: Check failed: static_cast<index_t>(res.rows) >= param_.data_shape[1] && static_cast<index_t>(res.cols) >= param_.data_shape[2] input image size smaller than input shape

Stack trace returned 6 entries:
[bt] (0) /home/stonedl3/.virtualenvs/dl4cv/lib/python3.5/site-packages/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc3cb0d3b1c]
[bt] (1) /home/stonedl3/.virtualenvs/dl4cv/lib/python3.5/site-packages/mxnet/../../lib/libmxnet.so(_ZN5mxnet2io21DefaultImageAugmenter7ProcessERKN2cv3MatEPSt6vectorIfSaIfEEPSt23mersenne_twister_engineImLm32ELm624ELm397ELm31ELm2567483615ELm11ELm4294967295ELm7ELm2636928640ELm15ELm4022730752ELm18ELm1812433253EE+0x10d8) [0x7fc3cbca3958]
[bt] (2) /home/stonedl3/.virtualenvs/dl4cv/lib/python3.5/site-packages/mxnet/../../lib/libmxnet.so(+0x12f5e4f) [0x7fc3cbd12e4f]
[bt] (3) /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0xf43e) [0x7fc3e473643e]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fc3e95e76ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fc3e931d41d]

terminate called after throwing an instance of 'dmlc::Error'
what(): [22:21:05] src/io/image_aug_default.cc:300: Check failed: static_cast<index_t>(res.rows) >= param_.data_shape[1] && static_cast<index_t>(res.cols) >= param_.data_shape[2] input image size smaller than input shape

Stack trace returned 6 entries:
[bt] (0) /home/stonedl3/.virtualenvs/dl4cv/lib/python3.5/site-packages/mxnet/../../lib/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7fc3cb0d3b1c]
[bt] (1) /home/stonedl3/.virtualenvs/dl4cv/lib/python3.5/site-packages/mxnet/../../lib/libmxnet.so(_ZN5mxnet2io21DefaultImageAugmenter7ProcessERKN2cv3MatEPSt6vectorIfSaIfEEPSt23mersenne_twister_engineImLm32ELm624ELm397ELm31ELm2567483615ELm11ELm4294967295ELm7ELm2636928640ELm15ELm4022730752ELm18ELm1812433253EE+0x10d8) [0x7fc3cbca3958]
[bt] (2) /home/stonedl3/.virtualenvs/dl4cv/lib/python3.5/site-packages/mxnet/../../lib/libmxnet.so(+0x12f5e4f) [0x7fc3cbd12e4f]
[bt] (3) /usr/lib/x86_64-linux-gnu/libgomp.so.1(+0xf43e) [0x7fc3e473643e]
[bt] (4) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fc3e95e76ba]
[bt] (5) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fc3e931d41d]

Aborted (core dumped)

@stonedl3
Copy link
Author

stonedl3 commented Feb 19, 2018 via email

@stonedl3
Copy link
Author

I have resolved the issue. I had used resize=256 while my training script was using 3x227x227 image. You can close the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants