def step(self) -> List[Union[RequestOutput, EmbeddingRequestOutput]]: """Performs one decoding iteration and returns newly generated results.
.. figure:: https://i.imgur.com/sv2HssD.png :alt: Overview of the step function :align: center
Overview of the step function.
Details: - Step 1: Schedules the sequences to be executed in the next iteration and the token blocks to be swapped in/out/copy.
- Depending on the scheduling policy, sequences may be `preempted/reordered`. - A Sequence Group (SG) refer to a group of sequences that are generated from the same prompt.
- Step 2: Calls the distributed executor to execute the model. - Step 3: Processes the model output. This mainly includes:
- Decodes the relevant outputs. - Updates the scheduled sequence groups with model outputs based on its `sampling parameters` (`use_beam_search` or not). - Frees the finished sequence groups.
- Finally, it creates and returns the newly generated results. """
# Create the block space manager. self.block_manager = BlockSpaceManagerImpl( block_size=self.cache_config.block_size, num_gpu_blocks=num_gpu_blocks, num_cpu_blocks=num_cpu_blocks, sliding_window=self.cache_config.sliding_window, enable_caching=self.cache_config.enable_prefix_caching)
# Sequence groups in the WAITING state. # Contain new prefill or preempted requests. self.waiting: Deque[SequenceGroup] = deque() # Sequence groups in the RUNNING state. # Contain decode requests. self.running: Deque[SequenceGroup] = deque() # Sequence groups in the SWAPPED state. # Contain decode requests that are swapped out. self.swapped: Deque[SequenceGroup] = deque() # Sequence groups finished requests ids since last step iteration. # It lets the model know that any state associated with these requests # can and must be released after the current step. # This is used to evict the finished requests from the Mamba cache. self._finished_requests_ids: List[str] = list()