We propose a Markov decision process model for solving the Web service composition (WSC) problem. Iterative policy evaluation,
value iteration, and policy iteration algorithms are used to experimentally validate our approach, with artificial and real data. The
experimental results show the reliability of the model and the methods employed, with policy iteration being the best one in terms
of the minimum number of iterations needed to estimate an optimal policy, with the highest Quality of Service attributes. Our
experimental work shows how the solution of a WSC problem involving a set of 100,000 individual Web services and where a valid
composition requiring the selection of 1,000 services from the available set can be computed in the worst case in less than 200
seconds, using an Intel Core i5 computer with 6 GB RAM. Moreover, a real WSC problem involving only 7 individual Web services
requires less than 0.08 seconds, using the same computational power. Finally, a comparison with two popular reinforcement learning
algorithms, sarsa and Q-learning, shows that these algorithms require one or two orders of magnitude and more time than policy
iteration, iterative policy evaluation, and value iteration to handle WSC problems of the same complexity.