Enhancing E-commerce Recommender System Adaptability with Online Deep
Controllable Learning-To-Rank
Anxiang Zeng
1,2
*, Han Yu
1
, Hualin He
3
*, Yabo Ni
3
, Yongliang Li
3
,
Jingren Zhou
3
and Chunyan Miao
1,2
*
1
School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore
2
Alibaba-NTU Singapore Joint Research Institute
3
Alibaba Group, Hangzhou, China
Abstract
In the past decade, recommender systems for e-commerce
have witnessed significant advancement. Recently, the fo-
cus of research has shifted from single objective optimiza-
tion to multi-objective optimization in the face of changing
business requirements. For instance, the add-to-cart rate is
the target of optimization prior to a promotional campaign,
while conversion rates should be kept from declining. Dur-
ing the campaign, this changes to maximize transactions on-
ly. Immediately after the campaign, click through rates max-
imization is required and transactions should be kept as dai-
ly level. Dynamically adapting among these short-term and
rapidly changing objectives is an important but difficult prob-
lem for optimization objectives are potentially conflicted with
each other. In this paper, we report our experience design-
ing and deploying the online Deep Controllable Learning-To-
Rank (DC-LTR) recommender system to address this chal-
lenge. It enhances the feedback controller in LTR with multi-
objective optimization so as to maximize different objectives
under constraints. Its ability to dynamically adapt to chang-
ing business objectives has resulted in significant business
advantages. Since September 2019, DC-LTR has become a
core service enabling adaptive online training and real-time
deployment ranking models for changing business objectives
in AliExpress and Lazada. Under both everyday use scenarios
and peak load scenarios during large promotional campaigns,
DC-LTR has achieved significant improvements in adaptively
satisfying real-world business objectives.
Introduction
As e-commerce platforms grew larger in scale, artificial in-
telligence (AI) techniques (e.g., agent and reputation mod-
elling (Pan et al. 2009; Yu et al. 2010; Shen et al. 2011)) are
increasingly being applied. Personalized recommendation is
playing an important role. On the one hand, improving the
recommendation accuracy and personalization was an active
area of research, among which the wide and deep model pro-
posed by Google (Cheng et al. 2016), the deep interested
network (Zhou et al. 2018) and the entire space multi-task
models (ESSM and ESSM2) (Ma et al. 2018) proposed by
Alibaba have been widely adopted.
On the other hand, recommender systems face different
business objectives in different scenarios and stages of rec-
Copyright
c
2021, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
ommendation (Kunaver and Pozrl 2017). Firstly, Recom-
mendation scenarios can be divided into different type (e.g.,
pre-, during-, post-purchase, campaign, promotion, bundle)
with different objectives for different user groups or differ-
ent businesses. Secondly, product recommendation during
online promotional campaigns with high traffic volumes of-
ten faces frequently changing business objectives. Moreover,
the log data and feature distributions of the recommendation
system are very different from these generated under normal
usage. Adapting to such changes quickly is key to the suc-
cess of promotional campaigns. Existing heavy models, such
as Click through rates (CTRs) and Conversion rates (CVRs)
predicting models (e.g., (Cheng et al. 2016)) in rough rank-
ing and full ranking phases, with single objective optimiza-
tion and daily building deployment, could not address this
business challenge.
Attempts were made to dynamically adapt among these
potentially conflicting optimization objectives. The entire s-
pace multi-task models, ESSM and ESSM2, are optimized
for both click-loss and pay-loss in the entire space by means
of multi-task learning (Ma et al. 2018; Wen et al. 2020). But
sub-models of ESSM were optimized separately with their
own losses, meaning that the optimization process was still a
collection of single objective optimizations. A gradient nor-
malization algorithm (GradNorm) for adaptive loss balanc-
ing was proposed in 2018 to model the uncertainty in deep
multi-task networks (Chen et al. 2018). Though optimiza-
tion process was improved for un-constraint multi-objective
problem, it lacked the ability to handle multi-objective prob-
lem with constraints. Reinforcement Learning was also tried
to tradeoff the CTR and CVR targets with online feedback
as the reward (Hu et al. 2018). But plenty of online explo-
ration may cause too much uncertainty and consumption of
online resources. Recently in 2019, A multi-gradient descen-
t algorithm for multi-objective recommender systems (MG-
DRec) was proposed to achieve Pareto optimization into rec-
ommendations (Milojkovic et al. 2019; Ning and Karypis
2010). Another multi-objective model-free Pareto-Efficient
framework for Learning to Rank (PE-LTR) (Lin et al. 2019;
Ribeiro et al. 2014) has also achieved remarkable online per-
formance. Unfortunately, these Pareto-based algorithms re-
quired strictly Pareto conditions or Pareto efficient condi-
tions during optimization, making the optimized final state
uncontrollable. Thus, they were not suitable for applications
The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)
15214
which need to meet specific business requirements.
In this paper, we report our experience designing and
deploying the Online Deep Controllable Learning-To-Rank
(DC-LTR) recommender system to address this challenge.
It enhances the feedback controller in LTR with multi-
objective optimization so as to maximize different objectives
under constraints. Its ability to dynamically adapt to chang-
ing business objectives has resulted in significant business
advantages. Since its deployment in AliExpress
1
and Laza-
da
2
from September 2019, DC-LTR has become a core ser-
vice, enabling adaptive online training and real-time deploy-
ment ranking models based on changing business objectives.
Under both everyday use scenarios and peak load scenarios
during large promotional campaigns, DC-LTR has achieved
significant improvements in adaptively satisfying real-world
business objectives.
Application Description
The general framework of recommendation system adopted
by Alibaba is presented in Figure 1. When a user initiated a
request, the customer behavior history data was used to se-
lect hundreds of thousands of related items from billions of
candidates with multiple channels in the matching/retrieval
phase. Then, a vector-based rough ranking model filtered
out top items in each channel, returning thousands of can-
didates. After that, the CTR and CVR prediction models
ranked items separately in the full ranking phase. No items
were filtered out during this stage because neither CTR nor
CVR could be the only ranking metric. Finally, a small non-
linear Learning-to-Rank (LTR) model (Joachims et al. 2007)
took the ranking outputs of the CTR and CVR models and
the user and item online features as inputs, to determine the
final dozens of items to be returned to the user. After slight
adjustments by scattering and re-ranking, the resulting item
list was returned to user’s applications (e.g., mobile apps,
web browsers, etc.). The LTR phase played the most direct
role for meeting business requirements in the entire recom-
mendation process (Zeng et al. 2020). As high-click items
may not always result in high transactions, trade-offs in C-
TR and CVR were often required.
When applied to the recommendation scenarios in Aliba-
ba, the LTR approach goes through the following steps as is
illustrated in Figure 2.
1. Real-time Log Analysis: Analysis, de-duplication, and
monitoring of original user logs such as exposures, click-
s, additional purchases, collections, and transactions were
included in this phase.
2. Real-time Sample Generation: Correlation between fea-
tures and label events such as clicks and transaction was
established. Log delay and misalignment of different logs
was fixed by stream join technique such as the retrac-
tion of long lagging pay events. Note that all the clicks
and purchase events happened after item exposure event.
Generally, most clicks happen several seconds after expo-
sures, while more than 40% purchase events happen 24
1
https://www.aliexpress.com/
2
https://www.lazada.com/
Figure 1: A general framework for recommender system
showing the matching phase and ranking phase. Note that
items were filtered by LTR model, instead of the CTR and
CVR models in the full ranking phase.
hours after the clicks and the item exposures. Thus, the
retraction technique, which retracts a negative sample and
reissues a corresponding positive sample, was applied to
obtain the pay labels. In other words, during online train-
ing, for any positive sample, there will be a correspond-
ing negative sample preceding it. Finally, different from
offline pre-processing of samples, an online stream buffer
pool technique was applied to achieve the dynamic bal-
ancing and repeated sampling of the samples.
3. Real-time Stream Model Training: Network structure de-
sign, model training and verification, area under the curve
(AUC) and other model indicator monitoring, online and
offline consistency verification were made during model
training phase. Online AUC for important features were
calculated, so that to monitor the distribution changing in
real-time. Online AUC feedbacks for model outputs were
also collected to indicate current status of scoring mod-
el. Scoring networks for Online LTR were designed to be
small sized with fast convergence, making the real-time
stream training and real-time deploying feasible.
4. Deployment: Model network structures were deployed by
daily building, while model weights of each layer were
deployed online every 5 min immediately after they were
updated. Previous model scores such as CTR, CVR model
score, rough ranking score and match score during match-
ing phase were collected as inputs for LTR model. And
the final score for candidate item was obtained together
with other basic features for item and user. Top ranking
items were then filtered to be presented in customer ap-
plications.
Apart from the real-time stream process, logs and samples
from real-time stream were collected by offline data process-
ing system. More complicated and complete analysis were
made in offline platform.
The DC-LTR algorithm in our AI Engine covers the above
four processes, of which the second step of sample genera-
tion and the third step of model training were the most im-
portant design points.
15215
Figure 2: LTR training and online deployment flowchart
Use of AI Technology
In this section, we describe the proposed algorithm of Deep
Controllable Learning to Rank (DC-LTR) approach in our
AI engine. The proposed approach is composed of a scor-
ing network, a bias network and a tuning network based on
feedback control, as is illustrated in Figure 3.
The scoring network for online DC-LTR are designed to
be light weight with fast convergence so that it can learn the
changing distribution and be deployed online in real time.
A network with no more than 100 feature inputs, no more
than 1Mb of embedding weights for sparse categorical fea-
tures, and 3 hidden fully connected layers of [128, 32, 16],
has been adopted. Batch Normalization is performed in the
input layer and every hidden layer before the Leaky ReLU
activation function to speed up model convergence.
The bias network includes user bias and loss bias. The
user bias network, with the same structure as the scoring
network, is applied during the training phase, but removed
when scoring online. The bias among different users learned
and the item ranking order for each user is still retained after
it is removed. Moreover, the loss bias network for decou-
pling was designed to reduce the influence between differ-
ent multi-objective losses. It is applied during the training
phase and removed when deployed. Similarly, no ranking
order would be contaminated by loss bias. With such adjust-
ments, significant improvement in convergence speed can be
achieved, making real-time stream training and real-time de-
ploying feasible.
The tuning network is a feedback controller implemented
using Tensorflow (Abadi et al. 2016). Feedbacks of the scor-
ing network were collected and constrained business objec-
tives were set, the constrained multi-objective optimization
process could be tuned automatically by the controller.
Constrained Multi-Objective Function
Constrained multi-objective function was incorporated in-
to DC-LTR to describe different business requirements with
the area under the curve (AUC) (Fawcett 2006) and the Pear-
son Correlation (CoRR) (Benesty et al. 2009) metrics.
In recommender systems, AUC is a widely used metric to
measure model performance during training. In most cases,
the relationship between AUC, CTR and the Sigmoid cross
entropy loss could be expressed by
entropy loss ↓∼ AU C ↑∼ CT R . (1)
By reducing the Sigmoid cross entropy loss, we could in-
crease the AUC and thus, the online performance of the rec-
ommender system. entropy loss could be expressed as (de-
noted by `
1
):
`
1
=
X
[y
l
(x) log ˆy(x) + (1 y
l
(x)) log(1 ˆy(x))] (2)
where y
l
{0, 1}, l (click, wish, cart, pay) are labels
for different events such as click, wish, cart and pay, ˆy
(0, 1) is the outputs of the scoring model network.
The AUC metric could be obtained by (Zhou et al. 2018)
AUC =
X
y
l
(x)=1
rank(ˆy)
N
M(1 + M )
2N
(3)
where M is the number of positive samples and N is the
number of negative samples. rank(ˆy) (1, M +N ) denotes
the rank for a sample x ordered by model score output ˆy.
15216
Figure 3: The system architecture of the AI Engine (the DC-LTR approach). The yellow components are the trainable parts
of whole framework. The gray and white components are un-trainable but tunable, The yellow arrows indicates the forward
process and the black arrows represents backward process.
Suppose that ˆy and y
gmv
were normalized. The relation-
ship between CoRR and the square loss could be expressed
as:
square loss ↓∼ CoRR (4)
The square loss is (denoted by `
2
):
`
2
=
X
(BN (y
l
(x)) BN (ˆy(x)))
2
. (5)
where BN denotes the normalization operation.
For daily use recommendation scenarios, the aim was to
maximize CTR while maintaining a high CVR. The opti-
mization problem could be described as:
max
ˆy=f(x)
AUC(ˆy, y
click
), s.t.
c
1
: AUC(ˆy, y
pay
) > r
0
(6)
where r
0
is the reference base pay-AUC. The online CVR
could be kept high by ensuring AUC(ˆy, y
pay
) > r
0
.
In contrast, during promotional campaigns such as the S-
ingles’ Day Festival (on November 11th), the business ob-
jective becomes maximizing CVR while maintaining a high
CTR:
max
ˆy=f(x)
AUC(ˆy, y
pay
), s.t.
c
1
: AUC(ˆy, y
click
) > r
1
(7)
To avoid reduction in transactions, or Gross Merchandis-
e Volumes (GMV), the above objective function must be
achieved while keeping the correlation between sample G-
MV and model outputs positive. Thus, it becomes:
max
ˆy=f(x)
AUC(ˆy, y
click
), s.t.
c
1
: AUC(ˆy, y
pay
) > r
0
c
2
: CoRR(ˆy, y
gmv
) > r
2
(8)
Specifically, y
gmv
is the normalized GMV of each sample.
Unconstrained Multi-Loss Decoupling
To solve the optimization problem with constraints, we first
consider an unconstrained case. By optimizing the AUC
objective with Sigmoid cross entropy loss, and optimizing
the CoRR objective with square loss, a multi-objective loss
function could then be formulated as a weighted sum of all
partial loss functions:
Loss =entropy loss(ˆy, y
click
) × w
click
+
entropy loss(ˆy, y
pay
) × w
pay
+
square loss(ˆy, y
gmv
) × w
gmv
+
· · · (9)
where entropy loss and square loss are described as be-
fore. Determining the weight w
click
, w
pay
, w
gmv
of each
loss function and decoupling the influence between loss
functions were an important problem.
Generally speaking, optimization of click-loss would
eventually make the model output converge around aver-
age CTR ( 0.04, Sigmoid activated), and optimization of
15217
pay-loss would eventually make the model output converge
around average CVR ( 0.0001, Sigmoid activated). The
summation of click-loss and pay-loss would confuse the
model outputs, resulting loss coupling. In our approach, a
trainable loss bias was introduced to decouple the influence
between different loss functions:
Loss =entropy loss(ˆy × θ
2
a
click
+ θ
b
click
, y
click
) × w
click
+
entropy loss(ˆy × θ
2
a
pay
+ θ
b
pay
, y
pay
) × w
pay
+
square loss(ˆy × θ
2
a
gmv
+ θ
b
gmv
, y
gmv
) × w
gmv
+
· · · (10)
where θ
a
click
, θ
b
click
, θ
a
pay
, θ
b
pay
, θ
a
gmv
and θ
b
gmv
are train-
able weights, for each different loss. w
click
, w
pay
, w
gmv
are un-trainable parameters to be tuned. By introducing the
trainable loss bias, losses were optimized around their mean
space and the coupling was reduced.
Previous research has shown that adjusting the weights of
each loss in unconstrained summation function could help
to solve the constraint multi-objective optimization prob-
lem (Chen et al. 2018). In this paper, we introduced a feed-
back controller to connect the unconstrained optimization
with their constraints.
Controllable Constrained Optimization
A practical way to deal with multiple objectives is to com-
bine all sub-model scores in a weighted manner:
ˆy = CT R
α
× CV R
β
× P rice
γ
(11)
where α, β and γ are hyper-parameters that can be adjust-
ed according to business requirements. For example, we can
set γ = 0 and α to a very small value to increase CVR per-
formance online. Increasing γ appropriately would lead to
an increase in the average spending by customers. Though
model free and easy to interpret, such a technique relies
heavily on manual tuning and lacks personalization for d-
ifferent customers.
Different to the manual dot product approach, a PID feed-
back controller was introduced to tuning the weights for
multi-objective loss in this paper (
˚
Astr
¨
om and H
¨
agglund
1995). Through the process of proportion (K
p
), integra-
tion (K
i
) and differentiation (K
d
), the final control output
u(t) related to the error scale can be obtained, as illustrat-
ed in Figure 4. A PID Controller with saturated proportional
component K
p
and saturated integration component K
i
has
been incorporated into DC-LTR. The differential component
Figure 4: A typical feedback control process. The model
training process is simplified as a single component with
AUC(t) feedback signal.
K
d
(O’Dwyer 2009) was not included. The controller tuning
process for different channels can be formulated as:
e(t) = r(t) auc(t) (12)
ζ(e) =
K
B
, e > K
B
e, 0 < e <= K
B
0, e <= 0
(13)
w K
p
ζ(e(t)) + K
i
ζ(
Z
e(t)) (14)
where e(t) is the feedback error between the current state
and the target state r(t). ζ is the saturated component to limit
the error and combination of errors.
The complete tuning process of DC-LTR can be sum-
marized as: Firstly, the outputs of the scoring network are
collected and normalized. Secondly, the AUC and CoR-
R metrics and feedback errors are calculated to be used
as inputs for the controllers to generate outputs u. Third-
ly, the controller outputs are mapped into loss weights
w
click
, w
pay
, w
gmv
through linear transformation. The opti-
mizer sums all the losses to obtain the gradients of all train-
able variables. Finally, the weights of the score network are
updated and the new model weights are deployed online ev-
ery few minutes.
More specifically, when the current feedback pay-AUC is
smaller than the pay-AUC target r
0
, the feedback error e(t)
will be positive for the PI controller of the pay channel. With
the tuning and amplification function of the controller, an in-
creasingly larger tuning weight will be produced for w pay
to improve pay-AUC performance. When the current feed-
back pay-AUC is larger than the target, it will produce a zero
or negative error for the PI controller. The tuning weights w
and w pay will stabilize and stop increasing, while the max-
imization of clicks w click will continue as the click-AUC
target r
1
is unreachable. This will result in the business re-
quirements being satisfied.
Application Development and Deployment
Application of Online DC-LTR model were made in the in-
ternational e-commerce business platform, AliExpress(AE)
and Lazada of Alibaba Group, where buyers come from
about 200 countries around the world. Deployment of DC-
LTR mainly followed the procedure as is described in Fig-
ure 2. During the deployment, some specific technique were
developed to meet the practical requirements. A stream sam-
ple buffer pool was developed to balance and sample sam-
ples in real-time. And an online offline consistency verifica-
tion technique was built to ensure the correctness of online
deployment.
Stream Sample Buffer Pool
Real-time logs such as exposures, clicks and transactions
were used to generate samples. The delay of real-time log
would cause the issue that a large amount of negative sam-
ples were received and trained before suddenly arriving of
positive samples. A sample buffer pool in real-time sample
generation was introduced to address this problem.
15218
Different to typical offline sample pre-processing, the
combination of an online bi-directional circular array and a
Last In First Output (LIFO) queue is required by the stream
sample buffer pool. Four basic operations (i.e. Pop, Time-
up, Enqueue and Sample) of the pool are defined to perform
high level transformations such as sample buffering, sam-
pling and balancing as is illustrated in Figure 5.
When a sample was generated, enqueue task was made
to buffer it into different sample buffer pool according to
its label. When a sample was too old or the queue size ex-
ceeded a max limits, dequeue task was made to forget the
oldest sample. When a sample requests come from training
phase, dequeue task to pop the newest negative samples and
repeated sampling task to sample from the existing positive
samples in pool are provided respectively. By adjusting the
positive repeated sampling rate and the negative queue size,
the usage of newest negative samples and a balance between
positive and negative samples were achieved. When abnor-
mal percentage of positive samples and negative samples
was detected, training process was held on until the sample
pool was filled with proper percentage of samples.
Online and Offline Consistency
In order to guarantee the correctness of deployment, online
and offline consistency were verified mainly by two aspects.
AUC consistency: to verify that the online data distribu-
tion is the same as the training distribution.
Scoring consistency : to verify that the online scoring val-
ue is the same as the training scoring value.
We managed to achieve an online and offline AUC abso-
lute difference within 0.01. For example, AUC difference
between online and offline on 22nd August, 2020 under the
Just For You scenario of AliExpres platform was 0.005 (of-
fline AUC: 0.690, online AUC: 0.695). The percentage of
wrong sorting pairs due to scoring difference was kept lower
than 3%. A subset of online samples were collected to score
the offline networks again after the real-time training pro-
cedure was turned off intentionally for several hours. Then,
the difference between online scores and offline scores were
compared by counting the wrong sorting pairs due to scor-
ing difference. For instance, the percentage of wrong sorting
pairs due to scoring difference was 0.45% on 25th February
2020, for there were 18 wrong sorting pairs among 4, 030
pairs.
Moreover, the real-time online AUC feedback are calcu-
lated for important feature inputs so as to monitor online
Figure 5: The stream sample buffer pool.
Figure 6: Online click-AUC for important features and the
LTR model outputs
status. The real-time AUC values for selected features be-
tween 09th to 15th September 2020 are shown in Figure 6.
Where ltr denotes the online score outputs of the LTR model
and was kept around 0.690 during these days. ctr, cvr are C-
TR and CVR model outputs, respectively. It can be observed
that the click AUC of the cvr score was rather small, as cvr
is the model outputs for pay events and is less relevant to
click events, indicating an obvious conflicts between CTR
and CVR models.
Maintenance
Manual setting of objective targets is required after the DC-
LTR AI Engine was deployed as a core service for AliEx-
press and Lazada platforms, especially during promotional
campaigns. Automatic alarms are designed to go off when
the real-time logs have been stuck for too long and a manual
diagnosis is required to recover real-time logs. Apart from
this, no other major maintenance task for the DC-LTR AI
Engine has been required since its deployment on AliEx-
press and Lazada in September 2019. We will continue to
review the system one year after its deployment.
Application Use and Payoff
Improvement in Convergence
An evaluation of loss decoupling was made to show the in-
fluence of loss bias technique. We applied the loss decou-
pling technique in a typical click-loss and pay-loss optimiza-
tion problem to the 3 hidden layer score networks. The sam-
ple batch size was set as 256, and learning rate was set as
0.001, together with category sparse embedding and Batch
Normalization, Leaky ReLU activation technique engaged.
Compared to the previously deployed LTR model without
decoupling, the decoupled model (i.e. DC-LTR) trained with
the same samples and hyper parameter settings has achieved
a reduction in click-loss from 0.22 to 0.16 and a reduction
in pay-loss from 0.005 to 0.0005 (Table 1). It showed that
Loss-bias Decoupling was good for both click-loss and pay-
loss, ensuring that they were optimized around their respec-
tive distribution space. As a result, the decoupled model con-
15219
Coupled Decoupled Change
Click-loss 0.22 0.16
Pay-loss 0.005 0.0005
Convergence Steps 500k 50k
Click-AUC 0.695 0.697
Pay-AUC 0.744 0.755
Table 1: Convergence Comparison
verged at around 50, 000 steps with click-AUC stabilizing
around 0.695. In contrast, the base model without decou-
pling required around 500, 000 steps to converge.
Through decoupling, the influence between different loss-
es were reduced under DC-LTR. This has also made separate
tuning of each loss feasible.
Improvement in Adaptability
The DC-LTR model was incorporated into the AliExpress
and Lazada platforms with the following losses: click-loss,
cart-loss (add to cart loss), wish-loss (add to wish list
loss), pay-loss, price-loss (discounted price square loss) and
GMV-loss (weighted GMV square loss). The controller pa-
rameters were set as listed in Table 2.
Figure 7 shows a segment of the captured system process
at 00:30 on 22nd August, 2020. An unknown disturbance
led to a decline in online pay-AUC below the target value
of 0.78. The drop was detected by the DC-LTR feedback
controller, which raised the u
pay
and w
pay
automatically.
After two hours of continuous training, the online pay-AUC
was increased to above the target level at 02:30. Thus, the
controller outputs started to decline slowly. The adjustment
process finally ceased at 05:30. Specifically, we set the click-
AUC target as 0.71, which was impossible to reach in prac-
tice. In this way, the controller would always attempt to op-
timize click-loss while keep the online pay-AUC above the
target level.
Figure 8 shows a segment of the captured system process
at 04:30 on 3rd September, 2020. An unknown disturbance
led to a decline in online wish-AUC below its target value
0.70. The drop was detected by the feedback controller of
DC-LTR. The controller triggered an increase in u
wish
and
w
wish
. After 6 hours of continuous training, the online wish-
AUC was raised back to above its target level at 10:30. Then,
u
0
r(t) K
p
K
i
K
B
Click-loss 1.0 0.710 1.0 0.0 2.0
Wish-loss 0.5 0.70 1.0 0.01 5.0
Cart-loss 0.5 0.70 1.0 0.01 5.0
Pay-loss 50.0 0.78 1.0 0.2 50.0
Price-loss 0.0 0.0 1.0 0.0 1e-3
GMV-loss 0.0 0.0 1.0 0.0 1e-3
Table 2: Parameters for the feedback controllers in each
channel. r(t) is the controller target. K
p
, K
i
are controller
parameters of the proportional and the integral components.
K
B
is the saturation max bounds for the integral compo-
nents. u
0
is the basic control output of the controller.
Figure 7: DC-LTR adjusts models automatically when pay-
AUC dropped down from the target level. Note that the
click-AUC target of 0.71 was set to be unreachable inten-
tionally to cause DC-LTR to continuously maximize clicks.
Figure 8: DC-LTR adjusts control outputs automatically to
keep the wish-AUC from dropping.
the controller output declined. The adjustment process even-
tually ceased at 15:30. With DC-LTR, the online wish-AUC
was kept above the target level when minimizing of click-
loss during the whole adjustment process.
Deployment Performance
The online performance achieved by DC-LTR in AliExpress
and Lazada were reported in Figure 9. The first experimen-
t under the Just For You scenario of AliExpress(AE JFY)
during March was an A/B test to study the adaptability of
DC-LTR with the objectives set as maximizing clicks while
keeping the GMV high. Prior to the activation of DC-LTR,
analysis of the status of this scenario showed that the ranking
score was negatively correlated with item prices, the CoR-
R between rank score and item price was around 0.15.
15220
In other words, items with lower prices were more likely
to be ranked in top positions, resulting a very low average
spending by customers. With the help of square loss in DC-
LTR, this correlation was improved to 0.01. As a result,
the number of clicks and GMV were increased by 3.11%
and 2.70%, respectively.
Another A/B test was conducted in the Detail Page s-
cenario of AliExpress (AE Detail) during March with the
objective of maximizing GMV while keeping the average
spending by customers high. Before the deployment of DC-
LTR, the strategy to generate final rank score was the simple
manual dot-product combination of CTR and CVR. Com-
pared to this manual combination strategy, DC-LTR has im-
proved number of clicks and GMV by 6.00% and 5.35%,
respectively. Moreover, the conflict between CVR and CTR
was resolved in this case. CTR, CVR, and average spending
by customers were simultaneously improved by DC-LTR.
The online training, real-time deploying framework powered
by DC-LTR has become a core service for both AliExpress
and Lazada. It is the only process that responds to changing
business requirements directly and change the optimization
objectives in real-time.
Prior the Singles’ Day Shopping Carnival of 2019 (i.e.
01st November to 10th November, 2019), the business ob-
jective was adjusted to maximize the add to cart rates. Cus-
tomers add their favorite items to carts, waiting for discounts
during the 24 hour period of 11th November, 2019 to com-
mence their purchases. In the early morning of 11th Novem-
ber, 2019, the business objectives were switched to maxi-
mizing transaction volume in anticipation for the start of
the shopping carnival. Online training and deployment faced
enormous challenges as visitor traffic raise to 3 10 times
than normal days. Thus, down-sampling was performed to
maintain a steady real-time training process. After compari-
son of the online performance between 2019 and 2018 was
made at noon, the average customer spending became the
maximization objective in AliExpress, while the transaction
maximization objective still holds in Lazada. Adjustment of
training were made automatically by DC-LTR in real time
after the objectives were updated and the new model took
effect around 5 minutes later. After the campaign, optimiza-
tion objectives were switched to maximizing clicks imme-
diately by DC-LTR with the constraints that GMV should
be kept at the level of normal days. Under peak load during
the 2019 Singles’ Day Shopping Carnival, DC-LTR helped
increase the GMV for both AliExpress and Lazada by 11%.
Figure 9: Online performance for the DC-LTR system under
different scenarios and phases in AliExpress and Lazada.
Lessons Learned During Deployment
It is worth mentioning that even with DC-LTR to automat-
ically adapt among different business objectives and con-
straints, there have still been some situations that the AI En-
gine could not handle. Several important lessons have been
learned from our deployment experience.
Firstly, proper targets should be set to avoid unintended
final states. The optimal final state can not be guaranteed
when two or more targets are unreachable. One possible so-
lution is to ensure that there can only be one unreachable
target so as to maximize this target with the constraints of
other reachable objectives.
Secondly, separately tune in different channel utilizing
only one channel of the feedback errors is a limitation. DC-
LTR currently did not make the best use of feedback er-
rors from other channels. To improve in this respect, more
powerful controllers, such as the Generative Predictive Con-
trol (Clarke, Mohtadi, and Tuffs 1987) and the Dynamic Ma-
trix Control (Haeri and Beik 2003), will be explored.
Conclusions and Future Work
In this paper, we report on our experience deploying a gen-
eral online training and real-time deployment framework for
ranking models in AliExpress and Lazada to automatically
adapt among rapidly changing business objectives. An algo-
rithmic framework, the online Deep Controllable Learning-
To-Rank, was proposed to automatically optimize with con-
strained business objectives and reduce the conflicts between
different objectives. To the best of our knowledge, this is the
only deployed attempt that managed to control the final state
of optimization to meet business requirements.
In subsequent research, we will focus on enhancing DC-
LTR in the following aspects:
Autonomous goal adjustment: the current DC-LTR still
requires the system administrators to manually set goals.
In the future, we will explore Bandit methods to enable
DC-LTR to learn to automate goal adjustment.
Delayed data distribution: Online training often lags be-
hind the actual data distribution. To address this problem,
we will explore how to leverage prior knowledge to adjust
the data distribution.
Acknowledgments
This research is supported, in part, by Alibaba Group
through Alibaba Innovative Research (AIR) Program and
Alibaba-NTU Singapore Joint Research Institute (JRI)
(Alibaba-NTU-AIR2019B1), Nanyang Technological U-
niversity, Singapore; the National Research Foundation,
Singapore, under its AI Singapore Programme (AISG
Award No: AISG-GC-2019-003); NRF Investigatorship
Programme (NRFI Award No: NRF-NRFI05-2019-0002);
Nanyang Assistant Professorship (NAP); the RIE 2020 Ad-
vanced Manufacturing and Engineering (AME) Program-
matic Fund (No. A20G8b0102), Singapore. Any opinions,
findings and conclusions or recommendations expressed in
this material are those of the author(s) and do not reflect the
views of National Research Foundation, Singapore.
15221
References
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean,
J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.
2016. Tensorflow: A system for large-scale machine learn-
ing. In OSDI, 265–283.
˚
Astr
¨
om, K. J.; and H
¨
agglund, T. 1995. PID controllers: the-
ory, design, and tuning, volume 2. Instrument Society of
America Research Triangle Park, NC.
Benesty, J.; Chen, J.; Huang, Y.; and Cohen, I. 2009. Pear-
son correlation coefficient. In Noise Reduction in Speech
Processing, 1–4. Springer.
Chen, Z.; Badrinarayanan, V.; Lee, C.-Y.; and Rabinovich,
A. 2018. Gradnorm: Gradient normalization for adaptive
loss balancing in deep multitask networks. In ICML, 794–
803.
Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.;
Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.;
et al. 2016. Wide & deep learning for recommender systems.
In RecSys, 7–10.
Clarke, D. W.; Mohtadi, C.; and Tuffs, P. 1987. Generalized
predictive controlPart I. The basic algorithm. Automatica
23(2): 137–148.
Fawcett, T. 2006. An introduction to ROC analysis. Pattern
Recognition Letters 27(8): 861–874.
Haeri, M.; and Beik, H. Z. 2003. Extension of Nonlinear
DMC for MIMO Systems. In ICCA, 375–379.
Hu, Y.; Da, Q.; Zeng, A.; Yu, Y.; and Xu, Y. 2018. Re-
inforcement learning to rank in e-commerce search engine:
Formalization, analysis, and application. In KDD, 368–377.
Joachims, T.; Li, H.; Liu, T.-Y.; and Zhai, C. 2007. Learning
to rank for information retrieval (LR4IR 2007). In SIGIR,
58–62.
Kunaver, M.; and Pozrl, T. 2017. Diversity in recommender
systems A survey. Knowledge-Based Systems 123: 154–
162.
Lin, X.; Chen, H.; Pei, C.; Sun, F.; Xiao, X.; Sun, H.; Zhang,
Y.; Ou, W.; and Jiang, P. 2019. A pareto-efficient algorithm
for multiple objective optimization in e-commerce recom-
mendation. In RecSys, 20–28.
Ma, X.; Zhao, L.; Huang, G.; Wang, Z.; Hu, Z.; Zhu, X.;
and Gai, K. 2018. Entire space multi-task model: An effec-
tive approach for estimating post-click conversion rate. In
SIGIR, 1137–1140.
Milojkovic, N.; Antognini, D.; Bergamin, G.; Faltings, B.;
and Musat, C. 2019. Multi-Gradient Descent for Multi-
Objective Recommender Systems. arXiv preprint arX-
iv:2001.00846 .
Ning, X.; and Karypis, G. 2010. Multi-task learning for rec-
ommender system. In ACML, 269–284.
O’Dwyer, A. 2009. Handbook of PI and PID controller tun-
ing rules. Imperial College Press.
Pan, L.; Meng, X.; Shen, Z.; and Yu, H. 2009. A reputation
pattern for service oriented computing. In ICICS, 1–5.
Ribeiro, M. T.; Ziviani, N.; Moura, E. S. D.; Hata, I.; Lacer-
da, A. M.; and Veloso, A. A. 2014. Multi-objective pareto-
efficient algorithms for recommender systems. ACM Trans-
actions on Intelligent Systems and Technology 5(4).
Shen, Z.; Yu, H.; Miao, C.; and Weng, J. 2011. Trust-based
web service selection in virtual communities. Web Intelli-
gence and Agent Systems 9(3): 227–238.
Wen, H.; Zhang, J.; Wang, Y.; Lv, F.; Bao, W.; Lin, Q.; and
Yang, K. 2020. Entire Space Multi-Task Modeling via Post-
Click Behavior Decomposition for Conversion Rate Predic-
tion. In SIGIR, 2377–2386.
Yu, H.; Cai, Y.; Shen, Z.; Tao, X.; and Miao, C. 2010. Agents
as intelligent user interfaces for the net generation. In IUI,
429–430.
Zeng, A.; Yu, H.; Da, Q.; Zhan, Y.; and Miao, C. 2020. Ac-
celerating Ranking in E-Commerce Search Engines through
Contextual Factor Selection. In IAAI, 13212–13219.
Zhou, G.; Zhu, X.; Song, C.; Fan, Y.; Zhu, H.; Ma, X.; Yan,
Y.; Jin, J.; Li, H.; and Gai, K. 2018. Deep interest network
for click-through rate prediction. In KDD, 1059–1068.
15222