Large language models are learning to code, but can they reason well enough to produce novel algorithms that can solve combinatorial optimization problems?
Many problems throughout science and technology are structured such that potential solutions to these problems are extremely efficient to evaluate, but for which finding a particularly good solution is extremely complex. These problems are so common that they show up in everyday life too.
Let’s say you’ve got a particularly long list of items you need to pick up from some local retailers this weekend. There’s a list of groceries, sure, but you also need to stop at the pharmacy to pick up a couple of prescription and non-prescription items. You need some new socks from the department store, and there are some tools from the hardware store you’ve been meaning to add to your toolbox. Luckily you have a car that you can use for transportation. But of course, it’s running low on gas. To make matters worse, you don’t get paid until next week, so you’d really like to minimize the overall cost of these errands .
How much does an outing like this cost? This is pretty easy to figure out: just keep your receipts and total up your costs when you get back home. But how should you go about completing these errands if you want to minimize this aggregate cost? This is much more difficult to determine.
You’d probably want to minimize your overall trip length to save on fuel expenses, so you’ll need to find the shortest route between your home, the stores, and back to your home. But the items on your list also cost money, and some items are sold at multiple stores, varying in price between these stores. So the shortest route between all of the stores may not minimize the overall costs if the cost of driving to the clothing store outweighs the cost of adding a more expensive pair of socks to your shopping cart while you’re shopping for groceries. In other words, there are a number of planning heuristics you could apply (minimize transportation distance, minimize per-item cost, minimize number of stores, choose a random store and start there, etc.) to potentially save money. This choice of planning heuristic provides a family of routes you could feasibly follow and can be used to determine a single cost-effective route.
However, these planning heuristics can vary substantially in the quality of the efficient routes they provide, and choosing a particular heuristic may preclude you from taking the absolute minimum-cost trip. For example, if the socks at the grocery store cost less than the additional cost of going to the department store, a heuristic which only minimizes the overall transportation distance of the trip will always be sub-optimal, as it assumes the department store is always a necessary waypoint on the itinerary.
Even problems as simple as running errands can quickly become logistically unwieldy under constraints. Now imagine the items on your list and the number of stores numbered in the tens or hundreds of thousands. At this scale, the discovery of a more efficient planning heuristic could save millions of dollars. These types of combinatorial optimization problems, problems with large solution spaces whose characterizations do not vary continuously with evaluated performance, underlie many sub-disciplines within the logistics and industrial engineering fields. Our shopping example problem is ubiquitous enough to have been given its own name in the literature: the traveling purchaser problem. This problem is a generalization of the better-known traveling salesman problem which involves finding the shortest route within a network which visits each node once, returning to the starting node at the end.
Combinatorial optimization appears in a variety of contexts across the sciences. In biology, genetic sequence assembly and phylogenetic tree construction both involve a search for an optimal ordering of objects from a discrete but overlapping set of unordered objects. Heuristics like Eulerian path-finding in De Brujin graphs and distance-based neighborhood joining are, respectively, employed to find approximate solutions to these problems. Determining the proper sequence of reagents, reaction temperatures, and intermediaries to create a target chemical compound is a combinatorial optimization problem which whose search space is extremely slow—and potentially dangerous—to explore. Astronomers optimize for celestial body observation by choosing from a constrained menu of telescopic equipment, each of which may be further constrained by unpredictable weather and target visibility patterns. Difficult optimization problems like this are so ubiquitous that many computationally-driven fields are completely determined by their orbit around an inherently intractable combinatorial optimization problem, with advances in these fields coming in the form of more effective heuristics for solving said problem.
Given the ubiquity and importance of heuristic solutions to combinatorial optimization problems, a method for autonomously generating more effective algorithmic solutions would have massive value across the sciences. This potential has been understood since at least the 1960s, with the discovery that combining dispatching rules in production scheduling could outperform independent application of these rules (Fisher & Thompson, 1963). As processing power grew throughout the 20th century, so too did the variety and efficiency of algorithms for heuristic discovery (Burke et al., 2013).
At the turn of the century, the term hyper-heuristics was coined to describe the emergence of tools which, buoyed by the availability of more efficient machine learning algorithms, learn to automatically create and apply heuristics to solve combinatorial optimization problems (Burke et al., 2013). Hyper-heuristic methods generally decompose into two intersecting classes of approaches: heuristic selection and generation.
Heuristic selection methods assume access to a collection of heuristics for solving an optimization problem, and the objective becomes one of learning the most effective strategy for applying the member heuristics for a particular realization of the problem. In our shopping example, we noted that there were a number of heuristics (transport cost minimization, per-store price optimization, random search, etc.) which could be applied to the problem. A hyper-heuristic search algorithm would seek to apply some combination of these heuristics to plan an outing given information like the number of stores or the variance of item prices.
Heuristic generation methods, by contrast, seek to produce new heuristics for an optimization problem given an online or offline collection of feedback signals which can be used to change the heuristics themselves. Because most combinatorial optimization problems are large and complex, each proposed solution must be computationally assessable, so heuristic generation procedures must return heuristics that can be phrased algorithmically as well. Historically, this typically meant hyper-heuristic generation methods required a domain expert to seed the collection of functions whose hyperparameters or execution order may be augmented and composed. Alternatively, a popular generation method, known as genetic programming, begins with an algorithmic solution and iteratively alters the program by making systematic mutations to its syntax tree or other high-level variables, retaining only the mutated programs which improve task performance. While genetic programming has the capacity to generate novel and effective programs, choosing the programmatic elements which may be altered throughout the evolutionary computation process requires human insight. Allowing these elements to be too granular (like each leaf in the syntax tree) could result in an extremely large search space, and evaluating each program would take significant time. If these programmatic elements to be mutated are too coarse, the hyper-heuristic method will only be able to explore the performance in a small neighborhood of the total solution space.
The necessity that heuristics be executable is a constraint which has shaped the hyper-heuristics field from its inception. While the heuristics resulting from prior synthetic program generation methods may be quite complex, their reliance on a pre-specified set of computational primitives typically limits the space of explorable heuristics to a small subset of the overall solution space. To enlarge the space of possible heuristics, one must provide more functions for the hyper-heuristic algorithm to mix and match, a substantial up-front labor cost.
But what if we could automate some of these algorithm development costs? With the emergence of LLM-based models for code generation in the past years, might there be a more efficient way to explore the space of programs which form heuristics for combinatorial optimization? As we will see, LLMs can provide this automation, but getting them to produce novel heuristics—instead of the most likely one—requires quite a bit of effort.
Given the recent success of large language models (LLMs) in generating code which can, with modest reliability, solve prompted computational tasks (Nijkamp et al., 2022; Fried et al., 2022; Chen et al., 2021), a number of recent papers have begun investigating the efficacy of employing LLMs to provide executable heuristic solutions to combinatorial optimization problems. Although this area of research is relatively new, it is moving quickly, so this section aims to provide a description of the general contours of LLM-based hyper-heuristic methods without too much concern for the specific details of each approach in the literature.
That said, although the LLM-based hyper-heuristic methods proposed in the literature have been applied to a diversity of optimization problems, including bin packing (Romera-Paredes et al., 2024), bitrate selection (He et al., 2024), the cap set problem (Romera-Paredes et al., 2024), data compression (Zeng, Zhou, Li, Sun, & Zhao, 2024), and traveling salesman problems (Ye, Wang, Cao, & Song, 2024; Liu et al., 2024), the hyper-heuristic methods by which each of these state-of-the-art algorithms are generated all follow the same evolutionary design principles. These principles are outlined below.
Let’s say you’re given an optimization problem similar to the traveling salesman problem (TSP) described in the introduction. That is, we have a list of locations which need to be visited before returning to the starting location, and we know that we’ll encounter some items and their prices throughout the trip. One way to solve this with an LLM is to just ask it for an algorithmic solution, giving it a description of the problem, the list of locations and their distances, and a description of the desired output as an optimized route and purchase list.
Prompting the LLM for a solution in this way could work—the LLM might output a novel TSP algorithm which outperforms the vast collection of TSP algorithms humans have already designed—but such an outcome is unlikely. This is because state-of-the-art language generation models are primarily trained to predict the most likely distribution of masked tokens given a collection of context tokens from the training corpus. For autoregressive models, which so far appear to achieve the best performance in code generation (Chen et al., 2021), this amounts to predicting the most likely distribution of proceeding tokens given a context window of prior tokens which, in this example, is the prompt.
We’re generally uninterested in the most likely distribution of tokens which follow our TSP prompt. This distribution of tokens is probably composed of the most popular TSP algorithms in the LLM’s training corpus, so it’s unlikely the generated algorithmic solution will be all that different than what one could find by searching online. If we wanted one of these common solutions, we would have been better off just finding and using the best TSP heuristic algorithm reported within the existing optimization literature.
Scientifically, what we’d actually like the LLM to generate a TSP heuristic that is differentiated from and, ideally, performs better than the other (hyper-)heuristic algorithms that already exist. This is much taller ask, for the reasons just described: the modus operandi of LLMs, and of generative modeling more generally, is to produce the most likely output conditioned on the input. To use an LLM as a heuristic generator, then, requires finding a method which can circumvent the LLM’s proclivity for producing commonplace algorithmic solutions while ensuring the solutions still compile and solve the optimization problem at hand.
Nearly all LLM-based hyper-heuristic methods in the literature use genetic programming to force the LLM to more courageously explore the space of heuristic algorithms. This exploration takes the form of prompt-based genetic programming, where the LLM is provided example algorithms and asked to come up with a new algorithm that achieves the same result as the examples. Although the specific details of each approach to LLM-based hyper-heurstics differ somewhat, the general evolutionary framework can be summarized as a six-step process.
Encoding
The first step is to figure out how to describe to the LLM, through a prompt, the optimization task. This likely takes the form of a textual description of the optimization problem and perhaps some expert domain knowledge which might help the LLM traverse the space of heuristics. We want the LLM to generate a heuristic in the form of an algorithm, and you’ll eventually need to run that algorithm in order to evaluate its efficacy, so you’ll probably want to ask the LLM to generate this algorithm with a particular function signature using a fixed collection of variable names so you can plug it into your evaluation suite directly (Liu, Tong, Yuan, & Zhang, 2023; Romera-Paredes et al., 2024). Otherwise, you might have to write some more complicated code to determine where amongst the LLM’s loquacious output lies the proposed algorithm (van Stein & Bäck, 2024). We’re going to ask the LLM to develop a new algorithm based on existing ones, so we must provide an initial collection of parent algorithms for it to alter. You can either seed these parent examples by including a couple of examples of combinatorial optimization solutions from the literature (Ye, Wang, Cao, & Song, 2024; Zeng, Zhou, Li, Sun, & Zhao, 2024) or by simply including the skeleton structure of the desired algorithm and asking the LLM to fill in the details (Romera-Paredes et al., 2024).
Initialization
Once you’ve chosen the encoding you’ll use to prompt the LLM into providing heuristic solutions, you can in principle create as many new solutions as you’d like (or can afford) by feeding this prompt into the LLM and extracting its output. But as previously discussed, these solutions will probably be rather familiar, so repeatedly passing the same prompt to the LLM will likely result in a collection of banal, repetitive solutions. One could adjust the LLM’s temperature during this generation process to introduce more variation in the prompts. However, this uncertainty must be balanced against the ever-present constraint that these output heuristics must be executable and free of bugs.
To force more consistent diversification, we’ll generate a seed collection of solutions which we will use to initialize our algorithmic population, then use an evolutionary principles to change the phenotypic composition of this population. The proceeding steps will mutate these seed algorithms such that they become increasingly differentiated and, ideally, performant with each generation.
We start this evolutionary process by creating N algorithms, evaluating them on the optimization task, and recording their performance.
Selection
From our collection of evaluated heuristic algorithms (there are N after the first initialization step, but this number could grow with each iteration of the evolutionary procedure), we want some way to choose the “parent” algorithms to be recombined and improved by the LLM. We could simply pick the solutions with the highest evaluation performance (Stein & Bäck, 2024), but doing this will likely result in the population of heuristics which quickly converges to a collection of similar solutions, as the best-performing algorithms in early iterations are likely to share similar characteristics. As with biological evolution, population diversity is necessary to support a broad exploration of the solution space. Our goal in the Selection step is to choose M heuristic algorithms which can be used by the LLM to generate a new generation of algorithms which diverge from their parents.
There are many ways to introduce diversity into this algorithmic evolutionary procedure. One could simply perform a random sampling over the existing heuristics (Liu, Tong, Yuan, & Zhang, 2023; Ye, Wang, Cao, & Song, 2024), or sample according to some performance-weighted (Zeng, Zhou, Li, Sun, & Zhao, 2024) or simplicity-weighted (Romera-Paredes et al., 2024) distribution.
However, sampling from the entire population each time may prematurely extinguish interesting solution families which can take numerous generations to develop. To accommodate these types of phenotypic specializations, we might instead force isolated clusters of heuristic solutions to develop on their own, creating more distinctive lineages through isolation. We can create these clusters either by asking the LLM to assign the existing algorithms into clusters based on their solution techniques (Zeng, Zhou, Li, Sun, & Zhao, 2024), or we can create synthetic clusters by simply assigning the initial N algorithms to C clusters and performing random selection cluster-wise instead (Romera-Paredes et al., 2024).
Crossover
After choosing M heuristic algorithms as parents according to some Selection procedure, the Crossover step seeks to create the next generation of heuristics from these parents. There are also multiple ways one could perform Crossover.
A popular approach is to concatenate these algorithms together along with quantitative information on their performance and prompt the LLM to generate a collection of new algorithms from these examples (Zeng, Zhou, Li, Sun, & Zhao, 2024; Liu, Tong, Yuan, & Zhang, 2023; van Stein & Bäck, 2024). Though other methods opt for more traditional two-parent crossover coupling (Ye, Wang, Cao, & Song, 2024; Romera-Paredes et al., 2024). It is presently unclear which approach works better, but we can say that concatenating each parent algorithm will increase the number of context tokens required by the prompt compared to the two-parent approach.
In either case, by including information about heuristic performance along with the full solution or skeleton code, we can leverage the ability of LLMs to perform in-context learning (Brown et al., 2020; Dong et al., 2022) to generate new samples which outperform their parents. Ye, Wang, Cao, & Song (2024) takes this learning a step further by requesting textual feedback from the LLM regarding the performance of each parent algorithm. This feedback is then included into the Crossover prompt along with the pair of parent algorithms and their performance metrics.
The result of the Crossover step is a collection of X new heuristic algorithms.
Mutation
While Crossover provides an evolutionary route for exploring the combinatorial breadth of the solution space, pure Crossover-based evolution strategies lack the capacity to investigate the regions around the heuristics that are particularly well-adapted for solving the optimization task. In other words, in addition to the evolutionary exploration provided by Crossover, we would also like to impose an evolutionary method for exploitation which allows the genetic process to further refine high-performing heuristics without becoming diluted by lower-performing solutions during Crossover. The Mutation step facilitates this refinement.
Zeng, Zhou, Li, Sun, & Zhao (2024) and Liu, Tong, Yuan, & Zhang (2023) perform mutation for each of the X samples produced during the Crossover step by prompting the LLM to suggest improvements to each child algorithm. By contrast Ye, Wang, Cao, & Song (2024) perform elite mutation, choosing a subset of the best-performing heuristics and prompting the LLM to improve only their performance. Although Romera-Paredes et al. (2024) don’t implement a mutation step directly, the method’s island-based evolutionary strategy has a similar exploitation-selection effect, as high-performing islands develop independently by performing within-island Crossover.
The mutation step results in E new solutions.
Population Management
Recall that we started with N heuristic algorithms at Initialization. But over the course of the evolutionary process, we’ve created X new algorithms and mutated E algorithms, leaving a total of N + X + E potential heuristics after each generation. Because the complexity and performance of each evolutionary step is dependent on the number of starting algorithms (N), we will likely need to manage the overall size of the algorithm population throughout this evolutionary procedure. The most common way to do this is to simply remove the D worst-performing solutions from the pool of heuristics. To maintain a consistent population size, one would set D = X + E.
After the (optional) Population Management step, these evolutionary algorithms return to the selection step, and this process is repeated until some convergence criteria are reached. At the end of this process, one is left with some collection of heuristic algorithms. These evolved algorithms can then be evaluated on a test set of problems to determine the generalized efficacy of these LLM-based heuristics.
Exploiting the abilities of LLMs to understand natural language instructions and generate executable code to search for novel optimization heuristics is a clever use of these emerging technologies. Indeed, this general paradigm of performing optimization over the algorithmic solution space of a problem, as opposed to optimizing a specific algorithm’s parameters, is quite exciting. Previously, without a reliable algorithmic solution generator, these high-level approaches required meticulous component design by domain experts to suitably span the solution space. By combining their capacity for code generation with an evolutionary algorithm, LLMs appear to be capable substitutes for this labor-intensive component design, or at the very least, can be used to accelerate this design process.
Despite these successes, LLM-based hyper-heuristic optimization still has a number of hurdles to overcome before all optimization can be completely and reliably automated at an algorithmic level of abstraction.
Apart from the experiments presented in Ye, Wang, Cao, & Song (2024), the LLM-based hyper-heuristic works covered in this article lack experimental comparisons between various hyper-heuristic solutions. This lack of benchmarking in the literature is currently excusable due to the relative nascency of the paradigm as a whole. However, if LLM-based hyper-heuristics is to maintain momentum as a field, future work must focus on benchmarking the current suite of LLM-based methods. Optimization researchers have already provided a number of well-known benchmark datasets which could be used for these purposes (Reinelt, 1991; Hansen, Finck, Ros, & Auger, 2009). I anticipate the more difficult problem in benchmarking will come from the models themselves. Because the versioning and APIs of the state-of-the-art LLMs continue to change quickly, ensuring each method’s implementation is replicable will require ongoing software development effort.
Although more powerful, LLM-based hyper-heuristic optimization is slow and expensive compared to more traditional methods. The LLMs that are best-suited for the instruction-following, code interpretation, and code generation are typically the highest-parameter models. These high-parameter models are expensive to run and take a long time to generate output. While both the parameter efficiency and generation speed of these models are improving, their usage as meta-optimizers currently comes with significant cost, and these costs may not be affordable to all researchers in the field. This lack of may decrease transparency and curtail adoption of these methods outside of the hyper-heuristics field.
In addition to their computational costs, the fallibility of the LLM backbones to these methods further decrease efficiency. All of the LLM-based hyper-heuristic methods described in this article must perform some type of error handling in order to excise and ignore buggy heuristics proposed by the LLM. Generating these erroneous functions and validating their performance costs time and money, so the inaccuracy and hallucination problems endemic to modern LLMs has a negative impact on the efficiency of LLM-based hyper-heuristics as well. Worse, many of the hyper-heuristic methods described in this article are reliant on the LLM’s ability to evaluate the quality of the proposed algorithmic solutions in context. Catching reasoning errors in these components is a much more difficult task, and the impact of these errors on the larger hyper-heuristic algorithm are more insidious.
Finally, there is a fundamental mismatch between the objectives of a generative language models and those of hyper-heuristics, and this tension makes LLM-based hyper-heuristic generation difficult. LLMs by default seek to generate the most likely output distribution of words, but hyper-heuristic generators seek novel algorithmic solutions which are unlikely to be constructible using the collection of most-likely tokens. The entire evolutionary scaffolding of LLM-based hyper-heuristics exists to extract and refine more creative approaches from the LLM. Viewed in this light, LLMs, or at least the objectives which are used to train them, are fundamentally mismatched to the goals of hyper-heuristic optimization. One could of course argue that these training objectives are necessary in order to produce an LLM that can even write executable code in the first place, and that changing these training objectives would result in even worse coding and reasoning capabilities from the LLM. However, there may be fertile ground between these two competing objectives to be explored, and hyper-heuristics research is well-situated to explore what may lie in this middle ground between predictability and novelty.