【PointPillars】环境部署全纪实

PointPillars是一个基于激光雷达点云的3D目标检测模型，兼顾了检测精度和检测效率。由于仓库代码比较老了，部署过程也遇到了不少问题，特此记录。
（由于显卡型号、系统架构等问题该记录不具备普适性，注意甄别）。
云平台环境：Ubuntu 18.04（X64架构）、GTX 1080ti（显存10G以上）、Python 3.7版本（建议anaconda管理）

Cuda Installation

Cuda：验证版本11.1,可以说很多问题都与Cuda版本有关（或Cuda与软件包版本）。
其他版本相关问题如下：
Cuda 12.x：报错numba软件包的IR version不兼容。
Cuda 10.1：编译SparseConvNet时报错，理由是10.1不兼容C++ 17。
Cuda 11.7：训练报错/usr/include/boost/math/constants/constants.hpp:265:1007: error: template argument 2 is invalid+一堆boost文件error，实际上与boost（1.65）没关系，更换低版本Cuda解决。
其他版本未验证。

下载runfile文件：本地部署如果下载缓慢尝试将com换成cn

1	wget https://developer.download.nvidia.com/compute/cuda/11.1.1/local_installers/cuda_11.1.1_455.32.00_linux.run

下载后运行安装:

sh xxx.run

注：显卡默认驱动535+，向下兼容Cuda，安装Cuda可不降级驱动，下载可选择自动应用安装的Cuda版本，若手动方法参考如下。

多Cuda版本切换问题：Cuda版本通过/usr/local/cuda软链接确定，只需要将Cuda链接到指定版本文件夹即可。

1 2	sudo rm -rf /usr/local/cuda sudo ln -s /usr/local/cuda-11.1 /usr/local/cuda

环境变量设置：vim ~/.bashrc

1 2	export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

如果切换频繁可考虑将命令写入sh脚本，运行脚本即可。 验证Cuda安装完成：

nvcc -V

Pytorch Installation

版本：1.9.1

1	pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

最重要问题：保证Pytorch版本与Cuda兼容，确保能调用显卡。官网的Pytorch旧版本（如1.9.1、1.8.1等）用conda install容易装成CPU版本，新版的conda则指定了安装路径来区分，注意甄别。如果是本地笔记本安装Pytorch出现:

ProxyError: Conda cannot proceed due to an error in your proxy configuration. Check for typos and other configuration errors in any '.netrc' file in your home directory, any environment variables ending in '_PROXY', and any other system-wide proxy configuration settings.

是本地网络开了代理导致的，关闭代理软件、设置禁用本地代理即可。验证方式：

1
2
3

python 进入编程环境
import torch
print(torch.cuda.is_available())

正常情况返回True，至此基本环境准备完毕。

Deploy

在工作区克隆仓库进行搭建，其余包安装步骤按照说明即可:

Welcome to PointPillars.

This repo demonstrates how to reproduce the results from PointPillars: Fast Encoders for Object Detection from Point Clouds (to be published at CVPR 2019) on the KITTI dataset by making the minimum required changes from the preexisting open source codebase SECOND.

This is not an official nuTonomy codebase, but it can be used to match the published PointPillars results.

WARNING: This code is not being actively maintained. This code can be used to reproduce the results in the first version of the paper, https://arxiv.org/abs/1812.05784v1. For an actively maintained repository that can also reproduce PointPillars results on nuScenes, we recommend using SECOND. We are not the owners of the repository, but we have worked with the author and endorse his code.

Getting Started

This is a fork of SECOND for KITTI object detection and the relevant subset of the original README is reproduced here.

Code Support

ONLY supports python 3.6+, pytorch 0.4.1+. Code has only been tested on Ubuntu 16.04/18.04.

Install

1. Clone code

1	git clone https://github.com/nutonomy/second.pytorch.git

2. Install Python packages

It is recommend to use the Anaconda package manager.

First, use Anaconda to configure as many packages as possible.

conda create -n pointpillars python=3.7 anaconda
source activate pointpillars
conda install shapely pybind11 protobuf scikit-image numba pillow
#conda install pytorch torchvision -c pytorch
conda install google-sparsehash -c bioconda

Then use pip for the packages missing from Anaconda.

1 2	pip install --upgrade pip pip install fire tensorboardX

Finally, install SparseConvNet. This is not required for PointPillars, but the general SECOND code base expects this to be correctly configured.

git clone git@github.com:facebookresearch/SparseConvNet.git
cd SparseConvNet/
bash build.sh
# NOTE: if bash build.sh fails, try bash develop.sh instead

Additionally, you may need to install Boost geometry:

1	sudo apt-get install libboost-all-dev

3. Setup cuda for numba

You need to add following environment variables for numba to ~/.bashrc:

1
2
3

export NUMBAPRO_CUDA_DRIVER=/usr/lib/x86_64-linux-gnu/libcuda.so
export NUMBAPRO_NVVM=/usr/local/cuda/nvvm/lib64/libnvvm.so
export NUMBAPRO_LIBDEVICE=/usr/local/cuda/nvvm/libdevice

4. PYTHONPATH

Add second.pytorch/ to your PYTHONPATH.

1	export PYTHONPATH=$PYTHONPATH:/your_second.pytorch_path/

Prepare dataset

1. Dataset preparation

Download KITTI dataset and create some directories first:

└── KITTI_DATASET_ROOT
       ├── training    <-- 7481 train data
       |   ├── image_2 <-- for visualization
       |   ├── calib
       |   ├── label_2
       |   ├── velodyne
       |   └── velodyne_reduced <-- empty directory
       └── testing     <-- 7580 test data
           ├── image_2 <-- for visualization
           ├── calib
           ├── velodyne
           └── velodyne_reduced <-- empty directory

Note: PointPillar's protos use KITTI_DATASET_ROOT=/data/sets/kitti_second/.

2. Create kitti infos:

1	python create_data.py create_kitti_info_file --data_path=/home/eden-mo/expend2/Python_Code/PointPillars/DataSet/KITTI_DATASET_ROOT

3. Create reduced point cloud:

1	python create_data.py create_reduced_point_cloud --data_path=/home/eden-mo/expend2/Python_Code/PointPillars/DataSet/KITTI_DATASET_ROOT

4. Create groundtruth-database infos:

1	python create_data.py create_groundtruth_database --data_path=/home/eden-mo/expend2/Python_Code/PointPillars/DataSet/KITTI_DATASET_ROOT

5. Modify config file

The config file needs to be edited to point to the above datasets: 注：修改/second.pytorch/second/configs/pointpillars/car/文件夹中.proto文件路径为自己的存储路径。服务器vim批量修改：mydir修改成new dir

1	%s#/mydir/#/new dir/g

train_input_reader: {
  ...
  database_sampler {
    database_info_path: "/path/to/kitti_dbinfos_train.pkl"
    ...
  }
  kitti_info_path: "/path/to/kitti_infos_train.pkl"
  kitti_root_path: "KITTI_DATASET_ROOT"
}
...
eval_input_reader: {
  ...
  kitti_info_path: "/path/to/kitti_infos_val.pkl"
  kitti_root_path: "KITTI_DATASET_ROOT"
}

Train

1 2	cd ~/second.pytorch/second python ./pytorch/train.py train --config_path=/root/autodl-tmp/second.pytorch/second/configs/pointpillars/car/xyres_16.proto --model_dir=/root/autodl-tmp/model_car1

If you want to train a new model, make sure "/path/to/model_dir" doesn't exist.
If "/path/to/model_dir" does exist, training will be resumed from the last checkpoint.
Training only supports a single GPU.
Training uses a batchsize=2 which should fit in memory on most standard GPUs.
On a single 1080Ti, training xyres_16 requires approximately 20 hours for 160 epochs.

训练前可以挂载后台避免网络变化/电脑睡眠导致训练失败：

nohup python -u train.py > log.file 2>&1 &

完整命令变为：(更新：新增pickle_result设为flase以输出PointRCNN可视化方法适应的格式)

1 2	nohup python -u ./pytorch/train.py train --config_path=Your dir --model_dir=/root/autodl-tmp/model_car3 --pickle_result=False > log1.file 2>&1 &

参数解释：
nohup:这是"no hang up"的缩写，表示不挂断。它的作用是让命令忽略SIGHUP（挂断信号），这样在用户退出终端或会话后，程序仍然能够继续执行。
u:u选项通常是用来在标准输出中禁用缓冲，使得输出能够实时显示。
log.file:将程序的标准输出重定向到一个名为"log.file"的文件中。
2>&1:将标准错误（stderr）重定向到标准输出。这样，程序的错误信息也会被写入到"log.file"中。
&:这是放在命令末尾的符号，表示将命令放入后台运行，使得终端可以继续输入其他命令，而不用等待该命令执行完毕。

查看输出文件:

tail -f log.file

查看执行进程：（新建终端仍有效）

ps ux

提前结束进程：

kill+进程号

Evaluate

1 2	cd ~/second.pytorch/second/ python pytorch/train.py evaluate --config_path= configs/pointpillars/car/xyres_16.proto --model_dir=/path/to/model_dir

Detection result will saved in model_dir/eval_results/step_xxx.
By default, results are stored as a result.pkl file. To save as official KITTI label format use --pickle_result=False.

Q&A

1. 软件包兼容问题
例如llvmlite、numpy、numba，llvmlite依赖于llvm版本，numba依赖于llvmlite，numpy又依赖于numba。遗憾的是，只有llvm给出了llvmlite的版本参考：

https://pypi.org/project/llvmlite/

查看llvm版本安装工具:

sudo apt-get install llvm-dev ——安装llvm工具包
llvm-config --version ——查看llvm版本
llvm-config --prefix ——查看llvm安装位置

但我的llvm是6.0.0，llvmlite是0.34.0貌似无影响，因此先考虑后两个包兼容问题。确定的一点是numpy版本宜为1.17.2，过高（如1.19）训练1h后数据类型会报错，低的道听途说貌似不太行。numba版本0.56对numpy而言太高了，因此我选用了0.51.0。其他版本只能一对一去验证。

2. No Module named "xxxx"
create.py、train.py貌似都试过报模块错误，方法是修改python文件，在报错模块import以前加入:

1 2	import sys sys.path.append("second.pytorch"的路径)

3. segmentation fault (core dumped)
这个还没找到原因，但大概率是因为Cuda、Pytorch、软件包不兼容导致的，和代码内存操作无关。

4. 二次训练要做的工作 如果需要重新训练，或者修改了数据集、克隆文件的路径： 1. 修改/second.pytorch/second/configs/pointpillars/car/中proto文件的路径 2. 修改train.py中sys.path.append()的路径，否则会提示torchplus模块找不到； 3. 重新进入/second.pytorch/second/ConvNet/，执行develop.sh脚本重新编译，否则会提示无法找到Sparse相关模块。