Some tips about using google’s TPU (Cont.)

Robin Dong 2018-09-28 16:23

Sometimes I get this error from TPUEstimator:

...
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 900, in run                                                 
    run_metadata_ptr)                                                                                                                                 
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1135, in _run                                               
    feed_dict_tensor, options, run_metadata)                                                                                                          
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1316, in _do_run                                           
    run_metadata)                                                                                                                                     
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1335, in _do_call                                           
    raise type(e)(node_def, op, message)                                                                                                  
tensorflow.python.framework.errors_impl.DeadlineExceededError: Deadline Exceeded

And after stop and restart TPU inconsole of GCP, the error disappeared. TPU doesn’t allow users to use it directly like GPU. You can’t see the device in VM looks like ‘/dev/tpu’ or something like this. Google provides TPU as RPC service, so you can only run DNN training through this service. I think this RPC service is not stable enough so sometimes it can’t work and lead to the error ‘Deadline Exceeded’.

When I get this type of error from TPU:

2018-09-29 01:57:12.779430: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:349] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.

The only solution is to create a new TPU instance and delete the old one in GCP console. Seems Google need to improve the robust of their TPU RPC service.

Running 10000 steps and get ‘loss’ for every turn:

INFO:tensorflow:Loss for final step: 3.2015076.
INFO:tensorflow:Loss for final step: 2.5733204.
INFO:tensorflow:Loss for final step: 1.8888541.
INFO:tensorflow:Loss for final step: 2.3713436.
INFO:tensorflow:Loss for final step: 2.9957836.
INFO:tensorflow:Loss for final step: 1.3974692.
INFO:tensorflow:Loss for final step: 1.3933656.
INFO:tensorflow:Loss for final step: 2.3544135.
INFO:tensorflow:Loss for final step: 1.9383199.
INFO:tensorflow:Loss for final step: 2.0213509.
INFO:tensorflow:Loss for final step: 1.8641331.
INFO:tensorflow:Loss for final step: 1.6767861.
INFO:tensorflow:Loss for final step: 2.63849.
INFO:tensorflow:Loss for final step: 2.19468.
INFO:tensorflow:Loss for final step: 1.9854712.
INFO:tensorflow:Loss for final step: 1.9380764.
INFO:tensorflow:Loss for final step: 0.97299415.
INFO:tensorflow:Loss for final step: 2.089243.
INFO:tensorflow:Loss for final step: 2.1150723.
INFO:tensorflow:Loss for final step: 1.8242038.
INFO:tensorflow:Loss for final step: 2.8426473.

It’s quite strange that the ‘loss’ can’t go low enough. I still need to do more experiments.

Previously, I runMobileNet_v2in a machine with Geforce GTX 960 and it could process 100 samples per second. And by using 8 TPUs of Version 2, it can process about 500 samples per second. Firstly, I am so disappointed about the performance-boosting of TPUv2, for it only has about 1.4 TFLOPS for each. But then I noticed that may be the bottleneck is not the performance of TPU, since IO is usually the limit for training speed. Besides, my model is MobileNet_v2, which is too simple and light so it can’t excavate all the capability of TPU.
Therefore I set ‘depth_multiplier=4’ for MobileNet_v2. Under this model, GTX 960 could process 21 samples per second, and TPUv2-8 could process 275 samples per second. This time, I can estimateeach TPUv2 has about 4 TFLOPS . I know this metric seems too low fromGoogle’s official 45 TFLOPS. But considering the possible bottlenecks of storage IO and network bandwidth, it becomes understandable. And also, there is another possibility: Google’s 45 TFLOPS means the half-precision operation performance

[返回] [原文链接]