Robust scenario approach for optimizing uncertainty

keyword: robust optimization, support division, multimodal distribution

contribution:
– suggests support division (SD) approach which addresses drawbacks of stochastic problem by transforming random variable from value to its range index.
– suggests methodologies for setting range for divided support both for small and large data regime.
– discusses approach to multimodal distribution and analyze the result of the proposed model on multimodal demands which are known to be diﬃcult to reconcile with majority of newsvendor models

literature review

1. data: bimodal

why bimodal?
First from methodology point of view, there are many known problems related to multimodal distribution. For example, the quality of approximation is bad for embedded Laplace approximation or mean field variational inference. Moreover, in distribution robust optimization, ambiguity sets disregarding the multimodality will produce overly conservative results (Hanasusanto15) and for simulation-based methods including Monte Carlo the estimation error of optimal solution based on simulation could great when the random variable’s density has multimodal distribution (Ekin20). Some reasons behind these frequent failure is due to the existence of local optimum in multimodal distribution. However, in application point of view, many multimodal distributions are observed owing to different latent variables that generates data such as different types of customer, topics, or trends (Hanasusanto15). Therefore, research on optimizing bimodal uncertainty, as a simplified example of multimodal, could be fruitful in both aspects.

2. Robust optimization
Robust optimization attempts to design a solution which are immune to data uncertainty (Bertsimas and Sim04). Depending on the existence of random variable in the model, it could be classiﬁed into non-probabilistic or probabilistic robust optimization. Wald’s model (Wald45) is a dominating paradigm for non-probabilistic model. As in equation 1, solutions are ranked on the basis of their worst-case costs and one with the least worst outcome is the optimal. Conservatism of its solutions has been noted (Bertsimas and Sim04). For example, distribution free newsvendor model (Gallego and Moon93) uses maximin approach to optimize order quantity based on limited information of uncertain demand: mean and standard deviation. Though this model could provide robust solutions for diverse distribution, its worst case is two point mass, which is unlikely to be realized. Probabilistic robust optimization, or stochastic optimization, is a numerical optimization problem that arises from observing data from random data-generating process (Duchi18). It includes random variables which leads to random objective functions or random constraints. Shapiro (12) explains minimax stochastic programming as a compromise between probabilistic and nonprobabilistic optimization. The former has been criticized for its diﬃculty to specify the distribution of unknown parameters whereas the latter for its conservative outcome.

Distributionally Robust Optimization (DRO) addresses these issues. Based on the available empirical data, partial distribution information such as support set or moment statistics are extracted to construct uncertainty set, D. DRO optimize the worst-case expected performance on a set constituted by an inﬁnite number of distributions contained in “uncertainty set”. DRO is less conservative than RO as RO tries to protect against parameter’s worst-case “realization” whereas DRO prevents its worst-case “distribution“. On the other hand, DRO is easier to formulate than SP as only the support of uncertain parameters are requested in advance compared to their exact distribution. Other concepts such as scenario-based SP (Chen02) exist that fills the gap of the four mentioned optimization.

3. Application

3.1 Newsvendor

Newsvendor Problem – The Tale of the First Formula in the Textbook | Stitch Fix Technology – Multithreaded — mean demand is not the optimal order quantity!

Newsvendor model suggests an optimal one-time ordering inventory amount that minimizes the expected cost.

It is widely used in business settings (Cachon and Terwiesch, 2008). If the order is larger than the actual demand, overage cost occurs, and if it is smaller, underage cost due to opportunity cost of sales occurs. Therefore, it is important to have adequate inventory. In the past, there have been studies on inventory optimization algorithms assuming the distribution of demand. Cachon and Terwiesch (2008) argued that if demand follows a normal distribution, the optimal value can be obtained in a closed form. Though normal distribution are frequently observed, the optimal solution that assumes the demand distribution could be inaccurate and not robust. This is especially true if the amount of demand is low. For example, Daskin et al. (2002) argued that demand follows a Poisson process. A demand prediction model that lowers the dimension of the distribution was presented by using the Lagrangian relaxation optimization algorithm that divides the demand distribution into multiple distributions.
If price and demand information are known, optimal stock level can be derived through newsvendor model. Benzion et al. (2010) argued that since it is impossible to know exactly the demand information, it is meaningless to assume distribution in the newsvendor model. Since real demand includes uncertainty, it is not desirable to ﬁt the demand distribution into a speciﬁc distribution. There have been studies to derive the optimum value without ﬁtting the demand distribution to a speciﬁc distribution. Gallego Moon (1993) proposed a distribution free model when information on demand distribution is limited to mean and variance only. This model assumes follows the minimax decision-making method that selects the highest value among the lowest proﬁts. The free distribution model has robustness in the form of demand distribution because it derives an optimal value by considering all possible demands using limited information. Considering all possible demands means that even the number of cases with small probability is considered, and for this reason, there is a limitation that the free distribution model may present values diﬀerent from the actual values when deriving the maximum proﬁt. Too conservative optimal solution has been criticized.

In this study, the distribution free model was selected as the comparative model together with the normal distribution model. This is to compare the robustness of the segmentation model when a distribution containing uncertainty occurs, contrary to the case of a stable normal distribution. Perakis Roels (2008) hypothesized a situation in which various information such as the mean and variance of the distribution are given. The decision-making of the model follows the criterion of minimizing maximum regret. In addition, Aryanezhad et al. (2012) paid attention to the diﬀerence between the order volume and the actual delivery volume caused by uncertainty in demand. The diﬀerence generated by this diﬀerence ﬂowing through the supply chain was derived using a nonlinear regression equation, and the optimal value was calculated by a systematic approach that considered the entire supply chain. Past newsvendor models were ﬁtted by assuming the distribution or did not assume the distribution. We tried to derive the optimal value using the information of the mean and variance. As the uncertainty of reality is included in the distribution, the actual distribution becomes complex. Therefore, the method that is suitable for assuming a speciﬁc distribution is limited in practical application. Models that do not assume a distribution can be said to be robust in that they derive an optimal value even if uncertainty is included, but there is a limit that satisfaction with the derived optimal value is low.

2. Support division (SD) model

When relatively accurate range forecast is possible, we could specify the uncertainty by transforming a continuous random variable d into discrete random variable m. m is an index that is set by learning the quantiles of historical data. This has two benefits. First, it is easier to average over m than d. Probability estimation becomes much more intuitive and easier when its object is range instead of exact value. Second, if we view the problem as two stage, where q is set in the first stage that minimizes the average of the second stage variable $C(q|d)$ from each d’s scenario, the transformation has the effect of reducing the number of scenario. For simplicity, 5 quantiles are chosen but any numbers and forms of range is possible as long as we know $p(m=1),...,p(m=M)$ . By defining the cost for each given range as a maximum cost over possible demand in the range i.e. $C(q|m) := max_{d \in d(m)} C(q|d)$ , the following are equivalent.
(1) $min_{q} E_{d} [C(q|d)]$
(2) $min_{q} E_{m}[C(q|m)]$
(3) $min_{q} E_{m}[max_{d \in d(m)} C(q|d)]$
(4) $min_{q} E_{m}[t_{m}]$ $\: s.t. C(q|d_{m}) \leq t_{m} \: \forall d_{m} \in d(m), \forall m = 1,...,M$
d: uncertainty random variable which follows distribution $D$
m: uncertainty index random variable that denotes which range of support a certain d belongs to ie. scenario
d(m): mth range of support division

Compare the following original minimax objective to equation (3). Averaging over the specified uncertainty has an effect of curtailing conservatism cost.
$min_{q} max_{d \in \cup_{m} d(m)} C(q|d)$

Degree of conservatism

Γ in constraints controls the tradeoﬀ between robustness and the level of conservativeness of the solution for each scenario by restricting the maximum number of the worst-case happening simultaneously.
(5)
$min_{q} E_{m}[\frac {L_{m}+H_{m}}{2} * (1-x_{m}) + H_{m}* x_{m}]$
$C(q|d_{m}) \leq H_{m} \\ C(q|d_{m}) \geq L_{m} \\ \sum_{m}x_{m} \leq \Gamma \\ \forall d_{m} \in d(m), x \in B^{M}\: \forall m = 1,...,M$

Range setting

In most cases, inverse of cumulative distribution of historical data could be used to get the quartile needed for the setting its range . If the problem in interest has a structure whereFor our model, the number and form of range is flexible, and it is recommended that the modeler should carefully choose the range sets where accurate mapping between the value and its range is possible. For example, it is easier demarcate ranges when the quantity of interest $d$ follows bimodal distribution compared to uniform. Mixture models such as Gaussian mixture model could be used in advance to learn the structure of the data. There has been many research that maps each data point to classes especially for multimodal distribution. Considering the cause of the multimodal distribution mentioned in Section 2, it is natural that maximum worst cost should be calculated on separate sets before being aggregated.

An online model where the updated result of mixture model, probability of each data belong to each class (range for our case), is delivered to the subsequent optimization (equation 2) to update the maximum cost of each range $max_{d \in d(m)} C(q|d)$ and the probability measure for the expectation $E_{m}$ could be possible.

Compared to other distributions, multimodal is more likely to suffer from lack of data problem as it has great number of parameters. For example, five parameters are needed to represent the most basic bimodal distribution. When data is not enough to construct distribution, prior or substitution likelihood could be applied. Dunson18 suggests quantile inference based on the following posterior which is that combines informative prior and intuitive substitution likelihood and which showed great performance in various distribution (Fig) including bimodal. Various priors were suggested including general proper, bounded, truncated normal prior which could be applied depending on our knowledge of the distribution such as its support and quantile. The role and importance of prior have been reported (Gelman05; Gelman17; piironen17) and our prior knowledge on the population especially regarding to its modality-causing components would be helpful.

$\pi(\boldsymbol{\theta} \mid \mathbf{y}) \propto s(\boldsymbol{\theta}) \pi(\boldsymbol{\theta})$

prior: proper, bounded, and truncated normal for quantile
$\pi(\boldsymbol{\theta}) \propto 1(\boldsymbol{\theta} \in \Omega) 1\left(\theta_{1} \geq c_{L}\right) 1\left(\theta_{m} \leq c_{U}\right)$
$\pi(\boldsymbol{\theta}) \propto 1(\boldsymbol{\theta} \in \Omega) g\left(\theta_{1}, \theta_{m}\right) \; \int_{-\infty}^{\infty} \int_{\theta_{1}}^{\infty} g\left(\theta_{1}, \theta_{m}\right)\left(\theta_{m}-\theta_{1}\right)^{m-2} d \theta_{m} d \theta_{1}<\infty$
$\pi(\boldsymbol{\theta}) = N_{m}(\boldsymbol{\theta}_{0}, \boldsymbol{\Sigma}_{0})$ s.t. $\boldsymbol{\theta} \in \Omega$

substitution likelihood:
$s(\boldsymbol{\theta})= {n \choose {\ u_{1}(\boldsymbol{\theta}) \cdots u_{m+1}(\boldsymbol{\theta})}} \prod_{l=1}^{m+1} \Delta p_{l}^{u_{l}(\boldsymbol{\theta})}$
$\theta_{0}=-\infty, \theta_{m+1}=\infty, u_{l}(\boldsymbol{\theta})=\sum_{i=1}^{n} 1_{\left(\theta_{l-1}, \theta_{l}\right]}\left(y_{i}\right) \Delta p=\left(p_{1}, p_{2}-p_{1}, \ldots, 1-p_{m}\right)^{\prime}$

3. Experiments

Different distribution

Bimodal distribution

SD showed great performance when the means and weights of two modes had great differences.

Conclusion

contribution

suggest support division approach which addresses drawbacks of stochastic problem through transforming random variable from value to its range index.
suggest methodologies for range setting in small and large data regime.
discuss approach to multimodal distribution and analyze the result of the proposed model on multimodal demands which are known to be diﬃcult to reconcile with majority of newsvendor models

limitation & future

prior design especially for bimodal: truncated normal prior for order-constrained bayesian (Gelfand92). Prior could greatly affect the performance of the model if properly understood in the context of the entire Bayesian analysis, from inference to prediction to model evaluation (Gelman17).
application of hierarchical (or multilevel) model: It is more flexible than discrete or continuous mixture model when addressing heterogeneity and over-dispersion (Mcelreath20, Gelman20). It would be interesting to design a hierachical prior and a likelihood where shared parameter could partial pool the information between subgroups.

Categories

literature review

2. Support division (SD) model

Range setting

3. Experiments

Conclusion

Comment is the energy for a writer, thanks!Cancel reply