Google Hashcode 2022

practice_problem.ipynb Open In Colab Kaggle Gradient Open In SageMaker Studio Lab Hits

Description: Google Hashcode 2022 Practice Problem

Tags: amplpy, heuristics, engineering, scheduling, complexity

Notebook author: Marcos Dominguez Velad <marcos@ampl.com>

Model author: Marcos Dominguez Velad <marcos@ampl.com>

References: N/A

Google Hashcode is a team programming competition to solve a complex engineering problem.

In this notebook we are showing how Mathematical Optimization methods as Mixed Integer Programming (MIP) are useful to solve this kind of problems, as they are really easy to implement and give optimal solutions (not only trade-off ones), as opposed to greedy approaches or heuristics. We are solving the pizza warm-up exercise.

We are using AMPL as the modeling language to formulate the problem from two different approaches (not all the formulations are the same in terms of complexity), coming up with enhancements or alternative approaches is an important part of the solving process.

As an instructive example of how to face this kind of problems, we are using the AMPL API for Python (AMPLPY), so we can read the input of the problem, translate easily to data for AMPL, and retrieve the solution to get the score. Because of using MIP approach, the score will be the highest possible for the problem.

Problem statement

The statement of this year is related to a pizzeria, the goal is to maximize the number of customers coming, and we want to pick the ingredients for the only pizza that is going to be sold:

  • Each customer has a list of ingredients he loves, and a list of those he does not like.

  • A customer will come to the pizzeria if the pizza has all the ingredients he likes, and does not have any disgusting ingredient for him.

Task: choose the exact ingredients the pizza should have so it maximizes the number of customers given their lists of preferences. The score is the number of customers coming to eat the pizza.

(The statement can be found here)

First formulation

The first MIP formulation will be straightforward. We have to define the variables we are going to use, and then the objective function and constraints will be easy to figure out.

Variables

We have to decide which ingredients to pick, so

  • $x_i$ = 1 if the ingredient i is in the pizza, 0 otherwise.

  • $y_j$ = 1 if the customer will come to the pizzeria, 0 otherwise.

Where $i = 1, .., I$ and $j = 1, .., c$ (c = total of customers and I = total of ingredients).

Objective function

The goal is to maximize the number of customers, so this is clear: $$maximize \ \sum \limits_{j = 1}^c y_j$$

Finally, we need to tie the variables to have the meaning we need by using constraints.

Constraints

If the j customer comes, his favourite ingredients should be picked (mathematically $y_j=1$ implies all the $x_i = 1$). So, for each $j = 1, .., c$:

$$|Likes_j| \cdot y_j \leq \sum \limits_{i \in Likes_j} x_i$$

Where $Likes_j$ is the set of ingredients $j$ customer likes, and $|Likes_j|$ the number of elements of the set.

If any of the disliked ingredients is in the pizza, customer $j$ can’t come (any $x_i = 1$ implies $y_j = 0$). For each customer $j = 1, .., c$:

$$\sum \limits_{i \in Dislikes_j} x_i \leq \frac{1}{2}+(|Dislikes_j|+\frac{1}{2})\cdot(1-y_j)$$

So when customer $j$ comes, the right side is equal to $$\frac{1}{2}+(|Dislikes_j|+\frac{1}{2})\cdot(1-1) = \frac{1}{2} + 0 = \frac{1}{2}$$ This implies the left side to be zero, because the $x_i$ variables are binary. If the customer $j$ does not come, the inequality is satisfied trivially.

We will need the input data files from the problem, they are available in the amplpy Github repository:

import os

if not os.path.isdir("input_data"):
    os.system("git clone https://github.com/ampl/colab.ampl.com.git")
    os.chdir("colab.ampl.com/authors/marcos-dv/hashcode")

if not os.path.isdir("ampl_input"):
    os.mkdir("ampl_input")

Let’s use AMPL to formulate the previous problem. The following section setup AMPL to run in also in the cloud (not only locally) with Google Colab.

AMPLPY Setup in the cloud

Here is some documentation and examples of the API: Documentation, GitHub Repository, PyPI Repository, other Jupyter Notebooks. The following cell is enough to install it. We are using ampl (modeling language) and COIN (contains aasdasdasdasdasdasdasdasdCBC open-source solver) modules.

# Install dependencies
%pip install -q amplpy
# Google Colab & Kaggle integration
from amplpy import AMPL, ampl_notebook

ampl = ampl_notebook(
    modules=["coin"],  # modules to install
    license_uuid="default",  # license to use
)  # instantiate AMPL object and register magics

Solving problem with AMPL

First, we need to write a model containing the mathematical formulation. After that, we will add the data to solve the different instances of the Hashcode problem.

%%ampl_eval

# PARAMETERS AND SETS
param total_customers;

# Set of ingredients
set INGR;
# Customers lists of preferences
set Likes{1..total_customers};
set Dislikes{1..total_customers};

# VARIABLES

# Take or not to take the ingredient
var x{i in INGR}, binary;
# customer comes OR NOT
var y{j in 1..total_customers}, binary;

# OBJECTIVE FUNCTION
maximize Total_Customers: sum{j in 1..total_customers} y[j];

s.t.
Customer_Likes{j in 1..total_customers}:
	card(Likes[j])*y[j] <= sum{i in Likes[j]} x[i];

param eps := 0.5;

Customer_Dislikes{j in 1..total_customers}:
	sum{i in Dislikes[j]} x[i] <= 1-eps+(card(Dislikes[j])+eps)*(1-y[j]);

Translate input with Python

The input files are in the folder input_data/, but they do not have the AMPL data format. Fortunately, we can easily parse the original input files to generate AMPL data.

import sys

# dict to map chars to hashcode original filenames
filename = {
    "a": "input_data/a_an_example.in.txt",
    "b": "input_data/b_basic.in.txt",
    "c": "input_data/c_coarse.in.txt",
    "d": "input_data/d_difficult.in.txt",
    "e": "input_data/e_elaborate.in.txt",
}


def read(testcase):
    with open(filename[testcase]) as input_file, open(
        "ampl_input/pizza_" + testcase + ".dat", "w+"
    ) as output_data_file:
        # total_customers
        total_customers = int(input_file.readline())
        ampl.param["total_customers"] = total_customers

        # loop over customers
        ingr = set()
        for c in range(1, total_customers + 1):
            likes = input_file.readline().split()
            ampl.set["Likes"][c] = likes[1:]
            dislikes = input_file.readline().split()
            ampl.set["Dislikes"][c] = dislikes[1:]
            ingr = ingr.union(set(likes))
            ingr = ingr.union(set(dislikes))
        ampl.set["INGR"] = ingr


# Let's try with problem 'c' from hashcode
read("c")

Now, solve the problem usign AMPL and CBC (mip solver)

%%ampl_eval
option solver cbc;
solve;
display x, y;
CBC 2.10.5: CBC 2.10.5 optimal, objective -5
0 nodes, 3 iterations, 0.00673 seconds
:       x   y    :=
1       .   0
2       .   0
3       .   1
4       .   1
5       .   1
6       .   0
7       .   1
8       .   1
9       .   0
10      .   0
'0'     0   .
'1'     0   .
'3'     0   .
akuof   1   .
byyii   1   .
dlust   1   .
luncl   1   .
qzfyo   0   .
sunhp   0   .
tfeej   1   .
vxglq   1   .
xdozp   1   .
xveqd   1   .
;

So the ingredients we should pick are:

  • byyii, dlust, luncl, tfeej, vxglq, xdozp and xveqd.

  • Customers coming are: 4, 5, 7, 8, 10. Total score: 5.

We can write an output file in the hashcode format:

%%ampl_eval
printf "%d ", sum{i in INGR} x[i] > output_file.out;
for{i in INGR}{
    if x[i] = 1 then printf "%s ", i >> output_file.out;
}
shell 'cat output_file.out';
8 luncl dlust xveqd byyii tfeej xdozp vxglq akuof 

You can try this with the other practice instances!

The big ones can take several hours to get the optimal solution, as MIP problems are usually hard because of the integrity constraints of the variables. That’s why it is often necessary to reformulate the problem, or try to improve an existing formulation by adding of combining constraints / variables. In the following section, we present an alternative point of view to attack the Hashcode practice problem, hoping the solver finds a solution earlier this way.

Alternative formulation

We could exploit the relations between customers and see if we can figure out of them. Actually, the goal is to get the biggest set of independent customers that are compatible (so none of their favourite ingredients are in the pizza). The ingredients we are picking may be deduced from the particular customers preferences we want to have.

With this idea, let’s propose a graph approach where each customer is represented by node, and two nodes are connected by an edge if and only if the two customers are compatible. This is translated to the problem as:

  • Customer i loved ingredients are not in the disliked ingredients list of j (and vice versa).

With sets, this is:

$$Liked_i \cap Disliked_j = Liked_j \cap Disliked_i = \emptyset $$

So the problem is reduced to find the maximal clique in the graph (a clique is a subset of nodes and edges such as every pair of nodes are connected by an edge), which is an NP-Complete problem. The clique is maximal respect to the number of nodes.

New variables

To solve the clique problem we may use the binary variables:

  • $x_i$ = 1 if the node belongs to the maximal clique, 0 otherwise. For each $i = 1, .., c$.

Objective function

It is the same as in the previous approach, as a node $i$ is in the maximal clique if and only if the customer $i$ is coming to the pizzeria in the corresponding optimal solution to the original problem. A bigger clique would induce a better solution, or a better solution would imply the solution customers to generate a bigger clique as all of them are compatible.

$$maximize \ \sum \limits_{i = 1}^c x_i$$

New constraints

The constraints are quite simple now. Two nodes that are not connected can’t be in the same clique. For each pair of nodes not connected $i$ and $j$: $$x_i + x_j \leq 1$$

Formulation with AMPL

We are writing a new model file (very similar to the previous one). In order to reuse data (read function), we will keep the INGR set although it is not going to be used anymore.

The most interesting feature in the model could be the condition to check that two customers are incompatible to generate a constraint. The condition is:

$$Liked_i \cap Disliked_j \neq \emptyset \ \text{ or } \ Liked_j \cap Disliked_i \neq \emptyset$$

A set is not empty if its cardinality is greater or equal to one, so in AMPL we could write:

card(Likes[i] inter Dislikes[j]) >= 1 or card(Likes[j] inter Dislikes[i]) >= 1

%%ampl_eval
reset;
# PARAMETERS AND SETS
param total_customers;

# Set of ingredients
set INGR;
# Customers lists of preferences
set Likes{1..total_customers};
set Dislikes{1..total_customers};

# VARIABLES

# customer comes OR NOT <=> node in the clique or not
var x{i in 1..total_customers}, binary;

# OBJECTIVE FUNCTION
maximize Total_Customers: sum{i in 1..total_customers} x[i];

s.t.
# Using the set operations to check if two nodes are not connected
Compatible{i in 1..total_customers-1, j in i+1..total_customers : card(Likes[i] inter Dislikes[j]) >= 1 or card(Likes[j] inter Dislikes[i]) >= 1}:
	x[i]+x[j] <= 1;

Read the data and solve:

read("c")
%%ampl_eval
option solver cbc;
solve;
display x;
CBC 2.10.5: CBC 2.10.5 optimal, objective -5
0 nodes, 0 iterations, 0.002318 seconds
x [*] :=
 1  0
 2  0
 3  1
 4  1
 5  1
 6  0
 7  1
 8  1
 9  0
10  0
;
%%ampl_eval
set picked_ingr default {};
for{i in 1..total_customers}{
    if x[i] = 1 then let picked_ingr := picked_ingr union Likes[i];
}

printf "%d ", card(picked_ingr) > output_file.out;
for{i in picked_ingr}{
    printf "%s ", i >> output_file.out;
}
shell 'cat output_file.out';
8 akuof luncl vxglq dlust xveqd tfeej xdozp byyii 

Conclusion

First, let’s compare the size of the two models.

  • First approach size: $c+I$ variables + $2c$ constraints.

  • Second approach size: $c$ variables + $c(c-1)/2$ constraints (potentially).

Also in the second approach, each constraint has only two non-zero coefficients along with variables, which is an advantage to have more sparse coefficient matrices.

The choice of one model or another will depend on the concrete instance of the problem, so the sparsity of the matrix and the real number of constraints can change (actually, the constraints of the two models are compatible). AMPL will take care of building the coefficient matrix efficiently, so there is no extra effort to compute the constraints or sums within them once the model is prepared and sent to the solver, and we can focus on thinking algorithmically. Also a lot of constraints and variables would be removed by presolve. To know more about the AMPL modeling language you can take a look to the manual.

Some of the advantages of this approach are:

  • It is really easy to implement solutions.

  • There is no need to debug algorithms, only the correctness of the model.

  • Models are very flexible, so new constraints could be added while the rest of the model remains the same.

Disadvantages:

  • It is hard to estimate how long it is going to take, even in simple models like the ones presented.

  • Sometimes it is hard to formulate the problem, as some of the constraints or the objective function could not adjust to the usual mathematical language. The problem could be non-linear so convergence would be more difficult and even optimal solutions would not be guaranteed.

  • For simple problems, more efficient algorithmic techniques could also give the best solution (Dynamic Programming, optimal greedy approaches…).

Enhancements:

  • Study the problem to come up with presolve heuristics in order to get smaller models.

  • Add termination criterias (solver options) so the solver can stop prematurely when finding a enough good solution (there is a little gap between the best found solution and the known bounds), or even a time limit. If you are lucky the solution could be the optimal one but the optimality was not proved yet.

  • If the solver could not find the optimal solution on time, but we used a termination criteria, we could retrieve a good solution and run some kind of algorithm over it so we can improve and get closer to the optimal (GRASP or Genetic Algorithms, for instance). Actually, when solving a real engineering problem is desirable to combine exact methods as MIP, heuristics (greedy approaches) or metaheuristics (GRASP, Simulated Annealing, …) among others, to reach better solutions.

Author: Marcos Dominguez Velad. Software engineer at AMPL.

marcos@ampl.com