Formulate the task of pushing-for-grasping as a Markov decision process:
a given state \(S_t\)
an action \(a_t\)
a police \(\pi(s_t)\)
a new state \(s_{t+1}\)
an immediate corresponding reward \(R_{a_t}(s_t,s_{t+1})\)
Learning objective is to iteratively minimize the temporal difference error \(\delta_{t}\) of \(Q_\pi(s_t,a_t)\) to a fixed target value \(y_t\)
δ
t
=
∣
Q
(
s
t
,
a
t
)
−
y
t
∣
\delta_t = |Q(s_t,a_t)-y_t|
δt=∣Q(st,at)−yt∣
y
t
=
R
a
t
(
s
t
,
s
t
+
1
)
+
γ
Q
(
s
t
+
1
,
a
r
g
m
a
x
a
′
(
Q
(
s
t
+
1
,
a
′
)
)
)
y_t = R_{a_t}(s_t,s_{t+1}) + \gamma Q(s_{t+1},argmax_{a^{'}}(Q(s_{t+1},a^{'})))
yt=Rat(st,st+1)+γQ(st+1,argmaxa′(Q(st+1,a′)))
\(a^{’}\) the set of all available actions
model each state \(s_t\) as an RGB-D heightmap image
Parameterize each action \(a_t\) as a motion primitive behavior \(\psi\) esecuted at the 3D loacation q projected from a pixel p of the heightmap images representation of the state \(s_t\):
a
=
(
ψ
,
q
)
∣
ψ
∈
p
u
s
h
,
g
r
a
s
p
,
q
→
p
∈
s
t
a = (\psi,q) | \psi \in {push,grasp}, q \to p \in s_t
a=(ψ,q)∣ψ∈push,grasp,q→p∈st
motion primitive behaviors are defined as follows:
Pushing: q starting position of a 10cm push in one of k = 16 directions
Grasping: q the middle position of a top-down parallel-jaw grasp in one of k=16 orientations
extend vanilla deep Q-networks(DQN) by modeling Q-function as two feed-forward fully convolutional networks \(\Phi_p\) \(\Phi_g\)
input: the heightmap image representation of the state s_t
outputs: a dense pixel-wise map of Q values with the same image size and resolution as that of \(s_t\)
Both FCNs φ p and φ g share the same network architecture: two parallel 121-layer DenseNet pre-trained on ImageNet , followed by channel-wise concatenation and 2 additional 1 × 1 convolutional layers interleaved with nonlinear activation functions (ReLU) and spatial batch normalization, then bilinearly upsampled.
\(R_g(s_t,s_{t+1}) = 1\) if grasp is successful
\(R_p(s_t,s_{t+1}) = 0.5\) if pushed that make detetable changes. if the sum of differences between heightmaps exceeds some threshold
Our Q-learning FCNs are trained at each iteration i using the Huber loss function:
因篇幅问题不能全部显示,请点此查看更多更全内容
怀疑对方AI换脸可以让对方摁鼻子 真人摁下去鼻子会变形
女子野生动物园下车狼悄悄靠近 后车司机按喇叭提醒
睡前玩8分钟手机身体兴奋1小时 还可能让你“变丑”
惊蛰为啥吃梨?倒春寒来不来就看惊蛰
男子高速犯困开智能驾驶出事故 60万刚买的奔驰严重损毁