Enhancing E-commerce Recommender System Adaptability with

Enhancing E-commerce Recommender System Adaptability with Online Deep

Controllable Learning-To-Rank

Anxiang Zeng

1,2

*, Han Yu

, Hualin He

*, Yabo Ni

, Yongliang Li

Jingren Zhou

and Chunyan Miao

1,2

School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore

Alibaba-NTU Singapore Joint Research Institute

Alibaba Group, Hangzhou, China

* [email protected], [email protected], [email protected]

Abstract

In the past decade, recommender systems for e-commerce

have witnessed signiﬁcant advancement. Recently, the fo-

cus of research has shifted from single objective optimiza-

tion to multi-objective optimization in the face of changing

business requirements. For instance, the add-to-cart rate is

the target of optimization prior to a promotional campaign,

while conversion rates should be kept from declining. Dur-

ing the campaign, this changes to maximize transactions on-

ly. Immediately after the campaign, click through rates max-

imization is required and transactions should be kept as dai-

ly level. Dynamically adapting among these short-term and

rapidly changing objectives is an important but difﬁcult prob-

lem for optimization objectives are potentially conﬂicted with

each other. In this paper, we report our experience design-

ing and deploying the online Deep Controllable Learning-To-

Rank (DC-LTR) recommender system to address this chal-

lenge. It enhances the feedback controller in LTR with multi-

objective optimization so as to maximize different objectives

under constraints. Its ability to dynamically adapt to chang-

ing business objectives has resulted in signiﬁcant business

advantages. Since September 2019, DC-LTR has become a

core service enabling adaptive online training and real-time

deployment ranking models for changing business objectives

in AliExpress and Lazada. Under both everyday use scenarios

and peak load scenarios during large promotional campaigns,

DC-LTR has achieved signiﬁcant improvements in adaptively

satisfying real-world business objectives.

Introduction

As e-commerce platforms grew larger in scale, artiﬁcial in-

telligence (AI) techniques (e.g., agent and reputation mod-

elling (Pan et al. 2009; Yu et al. 2010; Shen et al. 2011)) are

increasingly being applied. Personalized recommendation is

playing an important role. On the one hand, improving the

recommendation accuracy and personalization was an active

area of research, among which the wide and deep model pro-

posed by Google (Cheng et al. 2016), the deep interested

network (Zhou et al. 2018) and the entire space multi-task

models (ESSM and ESSM2) (Ma et al. 2018) proposed by

Alibaba have been widely adopted.

On the other hand, recommender systems face different

business objectives in different scenarios and stages of rec-

 2021, Association for the Advancement of Artiﬁcial

ommendation (Kunaver and Pozrl 2017). Firstly, Recom-

mendation scenarios can be divided into different type (e.g.,

pre-, during-, post-purchase, campaign, promotion, bundle)

with different objectives for different user groups or differ-

ent businesses. Secondly, product recommendation during

online promotional campaigns with high trafﬁc volumes of-

ten faces frequently changing business objectives. Moreover,

the log data and feature distributions of the recommendation

system are very different from these generated under normal

usage. Adapting to such changes quickly is key to the suc-

cess of promotional campaigns. Existing heavy models, such

as Click through rates (CTRs) and Conversion rates (CVRs)

predicting models (e.g., (Cheng et al. 2016)) in rough rank-

ing and full ranking phases, with single objective optimiza-

tion and daily building deployment, could not address this

business challenge.

Attempts were made to dynamically adapt among these

potentially conﬂicting optimization objectives. The entire s-

pace multi-task models, ESSM and ESSM2, are optimized

for both click-loss and pay-loss in the entire space by means

of multi-task learning (Ma et al. 2018; Wen et al. 2020). But

sub-models of ESSM were optimized separately with their

own losses, meaning that the optimization process was still a

collection of single objective optimizations. A gradient nor-

malization algorithm (GradNorm) for adaptive loss balanc-

ing was proposed in 2018 to model the uncertainty in deep

multi-task networks (Chen et al. 2018). Though optimiza-

tion process was improved for un-constraint multi-objective

problem, it lacked the ability to handle multi-objective prob-

lem with constraints. Reinforcement Learning was also tried

to tradeoff the CTR and CVR targets with online feedback

as the reward (Hu et al. 2018). But plenty of online explo-

ration may cause too much uncertainty and consumption of

online resources. Recently in 2019, A multi-gradient descen-

t algorithm for multi-objective recommender systems (MG-

DRec) was proposed to achieve Pareto optimization into rec-

ommendations (Milojkovic et al. 2019; Ning and Karypis

2010). Another multi-objective model-free Pareto-Efﬁcient

framework for Learning to Rank (PE-LTR) (Lin et al. 2019;

Ribeiro et al. 2014) has also achieved remarkable online per-

formance. Unfortunately, these Pareto-based algorithms re-

quired strictly Pareto conditions or Pareto efﬁcient condi-

tions during optimization, making the optimized ﬁnal state

uncontrollable. Thus, they were not suitable for applications

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

15214

which need to meet speciﬁc business requirements.

In this paper, we report our experience designing and

deploying the Online Deep Controllable Learning-To-Rank

(DC-LTR) recommender system to address this challenge.

It enhances the feedback controller in LTR with multi-

objective optimization so as to maximize different objectives

under constraints. Its ability to dynamically adapt to chang-

ing business objectives has resulted in signiﬁcant business

advantages. Since its deployment in AliExpress

and Laza-

from September 2019, DC-LTR has become a core ser-

vice, enabling adaptive online training and real-time deploy-

ment ranking models based on changing business objectives.

Under both everyday use scenarios and peak load scenarios

during large promotional campaigns, DC-LTR has achieved

signiﬁcant improvements in adaptively satisfying real-world

business objectives.

Application Description

The general framework of recommendation system adopted

by Alibaba is presented in Figure 1. When a user initiated a

request, the customer behavior history data was used to se-

lect hundreds of thousands of related items from billions of

candidates with multiple channels in the matching/retrieval

phase. Then, a vector-based rough ranking model ﬁltered

out top items in each channel, returning thousands of can-

didates. After that, the CTR and CVR prediction models

ranked items separately in the full ranking phase. No items

were ﬁltered out during this stage because neither CTR nor

CVR could be the only ranking metric. Finally, a small non-

linear Learning-to-Rank (LTR) model (Joachims et al. 2007)

took the ranking outputs of the CTR and CVR models and

the user and item online features as inputs, to determine the

ﬁnal dozens of items to be returned to the user. After slight

adjustments by scattering and re-ranking, the resulting item

list was returned to user’s applications (e.g., mobile apps,

web browsers, etc.). The LTR phase played the most direct

role for meeting business requirements in the entire recom-

mendation process (Zeng et al. 2020). As high-click items

may not always result in high transactions, trade-offs in C-

TR and CVR were often required.

When applied to the recommendation scenarios in Aliba-

ba, the LTR approach goes through the following steps as is

illustrated in Figure 2.

1. Real-time Log Analysis: Analysis, de-duplication, and

monitoring of original user logs such as exposures, click-

s, additional purchases, collections, and transactions were

included in this phase.

2. Real-time Sample Generation: Correlation between fea-

tures and label events such as clicks and transaction was

established. Log delay and misalignment of different logs

was ﬁxed by stream join technique such as the retrac-

tion of long lagging pay events. Note that all the clicks

and purchase events happened after item exposure event.

Generally, most clicks happen several seconds after expo-

sures, while more than 40% purchase events happen 24

https://www.aliexpress.com/

https://www.lazada.com/

Figure 1: A general framework for recommender system

showing the matching phase and ranking phase. Note that

items were ﬁltered by LTR model, instead of the CTR and

CVR models in the full ranking phase.

hours after the clicks and the item exposures. Thus, the

retraction technique, which retracts a negative sample and

reissues a corresponding positive sample, was applied to

obtain the pay labels. In other words, during online train-

ing, for any positive sample, there will be a correspond-

ing negative sample preceding it. Finally, different from

ofﬂine pre-processing of samples, an online stream buffer

pool technique was applied to achieve the dynamic bal-

ancing and repeated sampling of the samples.

3. Real-time Stream Model Training: Network structure de-

sign, model training and veriﬁcation, area under the curve

(AUC) and other model indicator monitoring, online and

ofﬂine consistency veriﬁcation were made during model

training phase. Online AUC for important features were

calculated, so that to monitor the distribution changing in

real-time. Online AUC feedbacks for model outputs were

also collected to indicate current status of scoring mod-

el. Scoring networks for Online LTR were designed to be

small sized with fast convergence, making the real-time

stream training and real-time deploying feasible.

4. Deployment: Model network structures were deployed by

daily building, while model weights of each layer were

deployed online every 5 min immediately after they were

updated. Previous model scores such as CTR, CVR model

score, rough ranking score and match score during match-

ing phase were collected as inputs for LTR model. And

the ﬁnal score for candidate item was obtained together

with other basic features for item and user. Top ranking

items were then ﬁltered to be presented in customer ap-

plications.

Apart from the real-time stream process, logs and samples

from real-time stream were collected by ofﬂine data process-

ing system. More complicated and complete analysis were

made in ofﬂine platform.

The DC-LTR algorithm in our AI Engine covers the above

four processes, of which the second step of sample genera-

tion and the third step of model training were the most im-

portant design points.

15215

Figure 2: LTR training and online deployment ﬂowchart

Use of AI Technology

In this section, we describe the proposed algorithm of Deep

Controllable Learning to Rank (DC-LTR) approach in our

AI engine. The proposed approach is composed of a scor-

ing network, a bias network and a tuning network based on

feedback control, as is illustrated in Figure 3.

The scoring network for online DC-LTR are designed to

be light weight with fast convergence so that it can learn the

changing distribution and be deployed online in real time.

A network with no more than 100 feature inputs, no more

than 1Mb of embedding weights for sparse categorical fea-

tures, and 3 hidden fully connected layers of [128, 32, 16],

has been adopted. Batch Normalization is performed in the

input layer and every hidden layer before the Leaky ReLU

activation function to speed up model convergence.

The bias network includes user bias and loss bias. The

user bias network, with the same structure as the scoring

network, is applied during the training phase, but removed

when scoring online. The bias among different users learned

and the item ranking order for each user is still retained after

it is removed. Moreover, the loss bias network for decou-

pling was designed to reduce the inﬂuence between differ-

ent multi-objective losses. It is applied during the training

phase and removed when deployed. Similarly, no ranking

order would be contaminated by loss bias. With such adjust-

ments, signiﬁcant improvement in convergence speed can be

achieved, making real-time stream training and real-time de-

ploying feasible.

The tuning network is a feedback controller implemented

using Tensorﬂow (Abadi et al. 2016). Feedbacks of the scor-

ing network were collected and constrained business objec-

tives were set, the constrained multi-objective optimization

process could be tuned automatically by the controller.

Constrained Multi-Objective Function

Constrained multi-objective function was incorporated in-

to DC-LTR to describe different business requirements with

the area under the curve (AUC) (Fawcett 2006) and the Pear-

son Correlation (CoRR) (Benesty et al. 2009) metrics.

In recommender systems, AUC is a widely used metric to

measure model performance during training. In most cases,

the relationship between AUC, CTR and the Sigmoid cross

entropy loss could be expressed by

entropy loss ↓∼ AU C ↑∼ CT R ↑ . (1)

By reducing the Sigmoid cross entropy loss, we could in-

crease the AUC and thus, the online performance of the rec-

ommender system. entropy loss could be expressed as (de-

noted by `

(x) log ˆy(x) + (1 − y

(x)) log(1 − ˆy(x))] (2)

where y

∈ {0, 1}, l ∈ (click, wish, cart, pay) are labels

for different events such as click, wish, cart and pay, ˆy ∈

(0, 1) is the outputs of the scoring model network.

The AUC metric could be obtained by (Zhou et al. 2018)

AUC =

(x)=1

rank(ˆy)

−

M(1 + M )

(3)

where M is the number of positive samples and N is the

number of negative samples. rank(ˆy) ∈ (1, M +N ) denotes

the rank for a sample x ordered by model score output ˆy.

15216

Figure 3: The system architecture of the AI Engine (the DC-LTR approach). The yellow components are the trainable parts

of whole framework. The gray and white components are un-trainable but tunable, The yellow arrows indicates the forward

process and the black arrows represents backward process.

Suppose that ˆy and y

gmv

were normalized. The relation-

ship between CoRR and the square loss could be expressed

as:

square loss ↓∼ CoRR ↑ (4)

The square loss is (denoted by `

(BN (y

(x)) − BN (ˆy(x)))

. (5)

where BN denotes the normalization operation.

For daily use recommendation scenarios, the aim was to

maximize CTR while maintaining a high CVR. The opti-

mization problem could be described as:

max

ˆy=f(x)

AUC(ˆy, y

click

), s.t.



: AUC(ˆy, y

pay

) > r

(6)

where r

is the reference base pay-AUC. The online CVR

could be kept high by ensuring AUC(ˆy, y

pay

) > r

In contrast, during promotional campaigns such as the S-

ingles’ Day Festival (on November 11th), the business ob-

jective becomes maximizing CVR while maintaining a high

CTR:

max

ˆy=f(x)

AUC(ˆy, y

pay

), s.t.



: AUC(ˆy, y

click

) > r

(7)

To avoid reduction in transactions, or Gross Merchandis-

e Volumes (GMV), the above objective function must be

achieved while keeping the correlation between sample G-

MV and model outputs positive. Thus, it becomes:

max

ˆy=f(x)

AUC(ˆy, y

click

), s.t.



: AUC(ˆy, y

pay

) > r

: CoRR(ˆy, y

gmv

) > r

(8)

Speciﬁcally, y

gmv

is the normalized GMV of each sample.

Unconstrained Multi-Loss Decoupling

To solve the optimization problem with constraints, we ﬁrst

consider an unconstrained case. By optimizing the AUC

objective with Sigmoid cross entropy loss, and optimizing

the CoRR objective with square loss, a multi-objective loss

function could then be formulated as a weighted sum of all

partial loss functions:

Loss =entropy loss(ˆy, y

click

) × w

click

entropy loss(ˆy, y

pay

) × w

pay

square loss(ˆy, y

gmv

) × w

gmv

· · · (9)

where entropy loss and square loss are described as be-

fore. Determining the weight w

click

, w

pay

, w

gmv

of each

loss function and decoupling the inﬂuence between loss

functions were an important problem.

Generally speaking, optimization of click-loss would

eventually make the model output converge around aver-

age CTR (∼ 0.04, Sigmoid activated), and optimization of

15217

pay-loss would eventually make the model output converge

around average CVR (∼ 0.0001, Sigmoid activated). The

summation of click-loss and pay-loss would confuse the

model outputs, resulting loss coupling. In our approach, a

trainable loss bias was introduced to decouple the inﬂuence

between different loss functions:

Loss =entropy loss(ˆy × θ

click

+ θ

click

, y

click

) × w

click

entropy loss(ˆy × θ

pay

+ θ

pay

, y

pay

) × w

pay

square loss(ˆy × θ

gmv

+ θ

gmv

, y

gmv

) × w

gmv

· · · (10)

where θ

click

, θ

click

, θ

pay

, θ

pay

, θ

gmv

and θ

gmv

are train-

able weights, for each different loss. w

click

, w

pay

, w

gmv

are un-trainable parameters to be tuned. By introducing the

trainable loss bias, losses were optimized around their mean

space and the coupling was reduced.

Previous research has shown that adjusting the weights of

each loss in unconstrained summation function could help

to solve the constraint multi-objective optimization prob-

lem (Chen et al. 2018). In this paper, we introduced a feed-

back controller to connect the unconstrained optimization

with their constraints.

Controllable Constrained Optimization

A practical way to deal with multiple objectives is to com-

bine all sub-model scores in a weighted manner:

ˆy = CT R

× CV R

× P rice

(11)

where α, β and γ are hyper-parameters that can be adjust-

ed according to business requirements. For example, we can

set γ = 0 and α to a very small value to increase CVR per-

formance online. Increasing γ appropriately would lead to

an increase in the average spending by customers. Though

model free and easy to interpret, such a technique relies

heavily on manual tuning and lacks personalization for d-

ifferent customers.

Different to the manual dot product approach, a PID feed-

back controller was introduced to tuning the weights for

multi-objective loss in this paper (

Astr

om and H

agglund

1995). Through the process of proportion (K

), integra-

tion (K

) and differentiation (K

), the ﬁnal control output

u(t) related to the error scale can be obtained, as illustrat-

ed in Figure 4. A PID Controller with saturated proportional

component K

and saturated integration component K

has

been incorporated into DC-LTR. The differential component

Figure 4: A typical feedback control process. The model

training process is simpliﬁed as a single component with

AUC(t) feedback signal.

(O’Dwyer 2009) was not included. The controller tuning

process for different channels can be formulated as:

e(t) = r(t) − auc(t) (12)

ζ(e) =







, e > K

e, 0 < e <= K

0, e <= 0

(13)

w ∼ K

ζ(e(t)) + K

ζ(

e(t)) (14)

where e(t) is the feedback error between the current state

and the target state r(t). ζ is the saturated component to limit

the error and combination of errors.

The complete tuning process of DC-LTR can be sum-

marized as: Firstly, the outputs of the scoring network are

collected and normalized. Secondly, the AUC and CoR-

R metrics and feedback errors are calculated to be used

as inputs for the controllers to generate outputs u. Third-

ly, the controller outputs are mapped into loss weights

click

, w

pay

, w

gmv

through linear transformation. The opti-

mizer sums all the losses to obtain the gradients of all train-

able variables. Finally, the weights of the score network are

updated and the new model weights are deployed online ev-

ery few minutes.

More speciﬁcally, when the current feedback pay-AUC is

smaller than the pay-AUC target r

, the feedback error e(t)

will be positive for the PI controller of the pay channel. With

the tuning and ampliﬁcation function of the controller, an in-

creasingly larger tuning weight will be produced for w pay

to improve pay-AUC performance. When the current feed-

back pay-AUC is larger than the target, it will produce a zero

or negative error for the PI controller. The tuning weights w

and w pay will stabilize and stop increasing, while the max-

imization of clicks w click will continue as the click-AUC

target r

is unreachable. This will result in the business re-

quirements being satisﬁed.

Application Development and Deployment

Application of Online DC-LTR model were made in the in-

ternational e-commerce business platform, AliExpress(AE)

and Lazada of Alibaba Group, where buyers come from

about 200 countries around the world. Deployment of DC-

LTR mainly followed the procedure as is described in Fig-

ure 2. During the deployment, some speciﬁc technique were

developed to meet the practical requirements. A stream sam-

ple buffer pool was developed to balance and sample sam-

ples in real-time. And an online ofﬂine consistency veriﬁca-

tion technique was built to ensure the correctness of online

deployment.

Stream Sample Buffer Pool

Real-time logs such as exposures, clicks and transactions

were used to generate samples. The delay of real-time log

would cause the issue that a large amount of negative sam-

ples were received and trained before suddenly arriving of

positive samples. A sample buffer pool in real-time sample

generation was introduced to address this problem.

15218

Different to typical ofﬂine sample pre-processing, the

combination of an online bi-directional circular array and a

Last In First Output (LIFO) queue is required by the stream

sample buffer pool. Four basic operations (i.e. Pop, Time-

up, Enqueue and Sample) of the pool are deﬁned to perform

high level transformations such as sample buffering, sam-

pling and balancing as is illustrated in Figure 5.

When a sample was generated, enqueue task was made

to buffer it into different sample buffer pool according to

its label. When a sample was too old or the queue size ex-

ceeded a max limits, dequeue task was made to forget the

oldest sample. When a sample requests come from training

phase, dequeue task to pop the newest negative samples and

repeated sampling task to sample from the existing positive

samples in pool are provided respectively. By adjusting the

positive repeated sampling rate and the negative queue size,

the usage of newest negative samples and a balance between

positive and negative samples were achieved. When abnor-

mal percentage of positive samples and negative samples

was detected, training process was held on until the sample

pool was ﬁlled with proper percentage of samples.

Online and Ofﬂine Consistency

In order to guarantee the correctness of deployment, online

and ofﬂine consistency were veriﬁed mainly by two aspects.

• AUC consistency: to verify that the online data distribu-

tion is the same as the training distribution.

• Scoring consistency : to verify that the online scoring val-

ue is the same as the training scoring value.

We managed to achieve an online and ofﬂine AUC abso-

lute difference within 0.01. For example, AUC difference

between online and ofﬂine on 22nd August, 2020 under the

Just For You scenario of AliExpres platform was 0.005 (of-

ﬂine AUC: 0.690, online AUC: 0.695). The percentage of

wrong sorting pairs due to scoring difference was kept lower

than 3%. A subset of online samples were collected to score

the ofﬂine networks again after the real-time training pro-

cedure was turned off intentionally for several hours. Then,

the difference between online scores and ofﬂine scores were

compared by counting the wrong sorting pairs due to scor-

ing difference. For instance, the percentage of wrong sorting

pairs due to scoring difference was 0.45% on 25th February

2020, for there were 18 wrong sorting pairs among 4, 030

pairs.

Moreover, the real-time online AUC feedback are calcu-

lated for important feature inputs so as to monitor online

Figure 5: The stream sample buffer pool.

Figure 6: Online click-AUC for important features and the

LTR model outputs

status. The real-time AUC values for selected features be-

tween 09th to 15th September 2020 are shown in Figure 6.

Where ltr denotes the online score outputs of the LTR model

and was kept around 0.690 during these days. ctr, cvr are C-

TR and CVR model outputs, respectively. It can be observed

that the click AUC of the cvr score was rather small, as cvr

is the model outputs for pay events and is less relevant to

click events, indicating an obvious conﬂicts between CTR

and CVR models.

Maintenance

Manual setting of objective targets is required after the DC-

LTR AI Engine was deployed as a core service for AliEx-

press and Lazada platforms, especially during promotional

campaigns. Automatic alarms are designed to go off when

the real-time logs have been stuck for too long and a manual

diagnosis is required to recover real-time logs. Apart from

this, no other major maintenance task for the DC-LTR AI

Engine has been required since its deployment on AliEx-

press and Lazada in September 2019. We will continue to

review the system one year after its deployment.

Application Use and Payoff

Improvement in Convergence

An evaluation of loss decoupling was made to show the in-

ﬂuence of loss bias technique. We applied the loss decou-

pling technique in a typical click-loss and pay-loss optimiza-

tion problem to the 3 hidden layer score networks. The sam-

ple batch size was set as 256, and learning rate was set as

0.001, together with category sparse embedding and Batch

Normalization, Leaky ReLU activation technique engaged.

Compared to the previously deployed LTR model without

decoupling, the decoupled model (i.e. DC-LTR) trained with

the same samples and hyper parameter settings has achieved

a reduction in click-loss from 0.22 to 0.16 and a reduction

in pay-loss from 0.005 to 0.0005 (Table 1). It showed that

Loss-bias Decoupling was good for both click-loss and pay-

loss, ensuring that they were optimized around their respec-

tive distribution space. As a result, the decoupled model con-

15219

Coupled Decoupled Change

Click-loss 0.22 0.16 ↓

Pay-loss 0.005 0.0005 ↓

Convergence Steps ∼ 500k ∼ 50k ↓

Click-AUC 0.695 0.697 ↑

Pay-AUC 0.744 0.755 ↑

Table 1: Convergence Comparison

verged at around 50, 000 steps with click-AUC stabilizing

around 0.695. In contrast, the base model without decou-

pling required around 500, 000 steps to converge.

Through decoupling, the inﬂuence between different loss-

es were reduced under DC-LTR. This has also made separate

tuning of each loss feasible.

Improvement in Adaptability

The DC-LTR model was incorporated into the AliExpress

and Lazada platforms with the following losses: click-loss,

cart-loss (add to cart loss), wish-loss (add to wish list

loss), pay-loss, price-loss (discounted price square loss) and

GMV-loss (weighted GMV square loss). The controller pa-

rameters were set as listed in Table 2.

Figure 7 shows a segment of the captured system process

at 00:30 on 22nd August, 2020. An unknown disturbance

led to a decline in online pay-AUC below the target value

of 0.78. The drop was detected by the DC-LTR feedback

controller, which raised the u

pay

and w

pay

automatically.

After two hours of continuous training, the online pay-AUC

was increased to above the target level at 02:30. Thus, the

controller outputs started to decline slowly. The adjustment

process ﬁnally ceased at 05:30. Speciﬁcally, we set the click-

AUC target as 0.71, which was impossible to reach in prac-

tice. In this way, the controller would always attempt to op-

timize click-loss while keep the online pay-AUC above the

target level.

Figure 8 shows a segment of the captured system process

at 04:30 on 3rd September, 2020. An unknown disturbance

led to a decline in online wish-AUC below its target value

0.70. The drop was detected by the feedback controller of

DC-LTR. The controller triggered an increase in u

wish

and

wish

. After 6 hours of continuous training, the online wish-

AUC was raised back to above its target level at 10:30. Then,

r(t) K

Click-loss 1.0 0.710 1.0 0.0 2.0

Wish-loss 0.5 0.70 1.0 0.01 5.0

Cart-loss 0.5 0.70 1.0 0.01 5.0

Pay-loss 50.0 0.78 1.0 0.2 50.0

Price-loss 0.0 0.0 1.0 0.0 1e-3

GMV-loss 0.0 0.0 1.0 0.0 1e-3

Table 2: Parameters for the feedback controllers in each

channel. r(t) is the controller target. K

, K

are controller

parameters of the proportional and the integral components.

is the saturation max bounds for the integral compo-

nents. u

is the basic control output of the controller.

Figure 7: DC-LTR adjusts models automatically when pay-

AUC dropped down from the target level. Note that the

click-AUC target of 0.71 was set to be unreachable inten-

tionally to cause DC-LTR to continuously maximize clicks.

Figure 8: DC-LTR adjusts control outputs automatically to

keep the wish-AUC from dropping.

the controller output declined. The adjustment process even-

tually ceased at 15:30. With DC-LTR, the online wish-AUC

was kept above the target level when minimizing of click-

loss during the whole adjustment process.

Deployment Performance

The online performance achieved by DC-LTR in AliExpress

and Lazada were reported in Figure 9. The ﬁrst experimen-

t under the Just For You scenario of AliExpress(AE JFY)

during March was an A/B test to study the adaptability of

DC-LTR with the objectives set as maximizing clicks while

keeping the GMV high. Prior to the activation of DC-LTR,

analysis of the status of this scenario showed that the ranking

score was negatively correlated with item prices, the CoR-

R between rank score and item price was around −0.15.

15220

In other words, items with lower prices were more likely

to be ranked in top positions, resulting a very low average

spending by customers. With the help of square loss in DC-

LTR, this correlation was improved to −0.01. As a result,

the number of clicks and GMV were increased by 3.11%

and 2.70%, respectively.

Another A/B test was conducted in the Detail Page s-

cenario of AliExpress (AE Detail) during March with the

objective of maximizing GMV while keeping the average

spending by customers high. Before the deployment of DC-

LTR, the strategy to generate ﬁnal rank score was the simple

manual dot-product combination of CTR and CVR. Com-

pared to this manual combination strategy, DC-LTR has im-

proved number of clicks and GMV by 6.00% and 5.35%,

respectively. Moreover, the conﬂict between CVR and CTR

was resolved in this case. CTR, CVR, and average spending

by customers were simultaneously improved by DC-LTR.

The online training, real-time deploying framework powered

by DC-LTR has become a core service for both AliExpress

and Lazada. It is the only process that responds to changing

business requirements directly and change the optimization

objectives in real-time.

Prior the Singles’ Day Shopping Carnival of 2019 (i.e.

01st November to 10th November, 2019), the business ob-

jective was adjusted to maximize the add to cart rates. Cus-

tomers add their favorite items to carts, waiting for discounts

during the 24 hour period of 11th November, 2019 to com-

mence their purchases. In the early morning of 11th Novem-

ber, 2019, the business objectives were switched to maxi-

mizing transaction volume in anticipation for the start of

the shopping carnival. Online training and deployment faced

enormous challenges as visitor trafﬁc raise to 3 ∼ 10 times

than normal days. Thus, down-sampling was performed to

maintain a steady real-time training process. After compari-

son of the online performance between 2019 and 2018 was

made at noon, the average customer spending became the

maximization objective in AliExpress, while the transaction

maximization objective still holds in Lazada. Adjustment of

training were made automatically by DC-LTR in real time

after the objectives were updated and the new model took

effect around 5 minutes later. After the campaign, optimiza-

tion objectives were switched to maximizing clicks imme-

diately by DC-LTR with the constraints that GMV should

be kept at the level of normal days. Under peak load during

the 2019 Singles’ Day Shopping Carnival, DC-LTR helped

increase the GMV for both AliExpress and Lazada by 11%.

Figure 9: Online performance for the DC-LTR system under

different scenarios and phases in AliExpress and Lazada.

Lessons Learned During Deployment

It is worth mentioning that even with DC-LTR to automat-

ically adapt among different business objectives and con-

straints, there have still been some situations that the AI En-

gine could not handle. Several important lessons have been

learned from our deployment experience.

Firstly, proper targets should be set to avoid unintended

ﬁnal states. The optimal ﬁnal state can not be guaranteed

when two or more targets are unreachable. One possible so-

lution is to ensure that there can only be one unreachable

target so as to maximize this target with the constraints of

other reachable objectives.

Secondly, separately tune in different channel utilizing

only one channel of the feedback errors is a limitation. DC-

LTR currently did not make the best use of feedback er-

rors from other channels. To improve in this respect, more

powerful controllers, such as the Generative Predictive Con-

trol (Clarke, Mohtadi, and Tuffs 1987) and the Dynamic Ma-

trix Control (Haeri and Beik 2003), will be explored.

Conclusions and Future Work

In this paper, we report on our experience deploying a gen-

eral online training and real-time deployment framework for

ranking models in AliExpress and Lazada to automatically

adapt among rapidly changing business objectives. An algo-

rithmic framework, the online Deep Controllable Learning-

To-Rank, was proposed to automatically optimize with con-

strained business objectives and reduce the conﬂicts between

different objectives. To the best of our knowledge, this is the

only deployed attempt that managed to control the ﬁnal state

of optimization to meet business requirements.

In subsequent research, we will focus on enhancing DC-

LTR in the following aspects:

• Autonomous goal adjustment: the current DC-LTR still

requires the system administrators to manually set goals.

In the future, we will explore Bandit methods to enable

DC-LTR to learn to automate goal adjustment.

• Delayed data distribution: Online training often lags be-

hind the actual data distribution. To address this problem,

we will explore how to leverage prior knowledge to adjust

the data distribution.

Acknowledgments

This research is supported, in part, by Alibaba Group

through Alibaba Innovative Research (AIR) Program and

Alibaba-NTU Singapore Joint Research Institute (JRI)

(Alibaba-NTU-AIR2019B1), Nanyang Technological U-

niversity, Singapore; the National Research Foundation,

Singapore, under its AI Singapore Programme (AISG

Award No: AISG-GC-2019-003); NRF Investigatorship

Programme (NRFI Award No: NRF-NRFI05-2019-0002);

Nanyang Assistant Professorship (NAP); the RIE 2020 Ad-

vanced Manufacturing and Engineering (AME) Program-

matic Fund (No. A20G8b0102), Singapore. Any opinions,

ﬁndings and conclusions or recommendations expressed in

this material are those of the author(s) and do not reﬂect the

views of National Research Foundation, Singapore.

15221

References

Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean,

J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al.

2016. Tensorﬂow: A system for large-scale machine learn-

ing. In OSDI, 265–283.

Astr

om, K. J.; and H

agglund, T. 1995. PID controllers: the-

ory, design, and tuning, volume 2. Instrument Society of

America Research Triangle Park, NC.

Benesty, J.; Chen, J.; Huang, Y.; and Cohen, I. 2009. Pear-

son correlation coefﬁcient. In Noise Reduction in Speech

Processing, 1–4. Springer.

Chen, Z.; Badrinarayanan, V.; Lee, C.-Y.; and Rabinovich,

A. 2018. Gradnorm: Gradient normalization for adaptive

loss balancing in deep multitask networks. In ICML, 794–

803.

Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.;

Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.;

et al. 2016. Wide & deep learning for recommender systems.

In RecSys, 7–10.

Clarke, D. W.; Mohtadi, C.; and Tuffs, P. 1987. Generalized

predictive controlPart I. The basic algorithm. Automatica

23(2): 137–148.

Fawcett, T. 2006. An introduction to ROC analysis. Pattern

Recognition Letters 27(8): 861–874.

Haeri, M.; and Beik, H. Z. 2003. Extension of Nonlinear

DMC for MIMO Systems. In ICCA, 375–379.

Hu, Y.; Da, Q.; Zeng, A.; Yu, Y.; and Xu, Y. 2018. Re-

inforcement learning to rank in e-commerce search engine:

Formalization, analysis, and application. In KDD, 368–377.

Joachims, T.; Li, H.; Liu, T.-Y.; and Zhai, C. 2007. Learning

to rank for information retrieval (LR4IR 2007). In SIGIR,

58–62.

Kunaver, M.; and Pozrl, T. 2017. Diversity in recommender

systems A survey. Knowledge-Based Systems 123: 154–

162.

Lin, X.; Chen, H.; Pei, C.; Sun, F.; Xiao, X.; Sun, H.; Zhang,

Y.; Ou, W.; and Jiang, P. 2019. A pareto-efﬁcient algorithm

for multiple objective optimization in e-commerce recom-

mendation. In RecSys, 20–28.

Ma, X.; Zhao, L.; Huang, G.; Wang, Z.; Hu, Z.; Zhu, X.;

and Gai, K. 2018. Entire space multi-task model: An effec-

tive approach for estimating post-click conversion rate. In

SIGIR, 1137–1140.

Milojkovic, N.; Antognini, D.; Bergamin, G.; Faltings, B.;

and Musat, C. 2019. Multi-Gradient Descent for Multi-

Objective Recommender Systems. arXiv preprint arX-

iv:2001.00846 .

Ning, X.; and Karypis, G. 2010. Multi-task learning for rec-

ommender system. In ACML, 269–284.

O’Dwyer, A. 2009. Handbook of PI and PID controller tun-

ing rules. Imperial College Press.

Pan, L.; Meng, X.; Shen, Z.; and Yu, H. 2009. A reputation

pattern for service oriented computing. In ICICS, 1–5.

Ribeiro, M. T.; Ziviani, N.; Moura, E. S. D.; Hata, I.; Lacer-

da, A. M.; and Veloso, A. A. 2014. Multi-objective pareto-

efﬁcient algorithms for recommender systems. ACM Trans-

actions on Intelligent Systems and Technology 5(4).

Shen, Z.; Yu, H.; Miao, C.; and Weng, J. 2011. Trust-based

web service selection in virtual communities. Web Intelli-

gence and Agent Systems 9(3): 227–238.

Wen, H.; Zhang, J.; Wang, Y.; Lv, F.; Bao, W.; Lin, Q.; and

Yang, K. 2020. Entire Space Multi-Task Modeling via Post-

Click Behavior Decomposition for Conversion Rate Predic-

tion. In SIGIR, 2377–2386.

Yu, H.; Cai, Y.; Shen, Z.; Tao, X.; and Miao, C. 2010. Agents

as intelligent user interfaces for the net generation. In IUI,

429–430.

Zeng, A.; Yu, H.; Da, Q.; Zhan, Y.; and Miao, C. 2020. Ac-

celerating Ranking in E-Commerce Search Engines through

Contextual Factor Selection. In IAAI, 13212–13219.

Zhou, G.; Zhu, X.; Song, C.; Fan, Y.; Zhu, H.; Ma, X.; Yan,

Y.; Jin, J.; Li, H.; and Gai, K. 2018. Deep interest network

for click-through rate prediction. In KDD, 1059–1068.

15222