Skip to content

Instantly share code, notes, and snippets.

View simon-mo's full-sized avatar

Simon Mo simon-mo

View GitHub Profile
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
{
"question_id": "58210e39b3fd4441a2bd4a518bb44c2d",
"prompt": "What is the difference between OpenCL and CUDA?",
"openai_scores_raw_choices_nested": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "{\n \"topic_modeling\": \"Technical Comparison\",\n \"score_reason\": \"This prompt requires the AI to accurately compare and contrast two distinct technologies, OpenCL and CUDA. It assesses the AI's factual accuracy and knowledge of these technologies, as well as its ability to articulate the differences between them.\",\n \"score_value\": 9\n}",
layout title author
post
vLLM Now Supports AMD GPUs
vLLM team and EmbeddedLLM

TL;DR:

  • With helps from EmbeddedLLM team, vLLM can now run on top of ROCm enabled AMD GPUs.
@simon-mo
simon-mo / osd.md
Last active October 31, 2023 08:25

abstract: | Speculative decoding is a pivotal technique to accelerate the inference of large language models (LLMs) by employing a smaller draft model to predict the target model’s outputs. However, its efficacy can be limited due to the low predictive accuracy of the draft model, particularly when faced with diverse text inputs and a significant capability gap between the draft and target models. We introduce online speculative decoding to address this challenge. The main idea is to continually update (multiple) draft model(s) on observed user

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
xmlns:xlink="http://www.w3.org/1999/xlink">
<teiHeader xml:lang="en">
<fileDesc>
<titleStmt>
<title level="a" type="main">Ambry</title>
</titleStmt>

WANalytics: Analytics for a Geo-Distributed Data-Intensive World

WANalytics is proposed, a system that pushes computation to edge data centers, automatically optimizing workow execution plans and replicating data when needed, which delivers substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.

Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor suciently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur subs

When will you run into this issue:

This issue will occur when running a python function on ray cluster with @ray.remote and the function runs on head node instead of worker node.

Here's how to fix it:

Functions are scheduled on a node that has available CPUs. So it is normal that it is scheduled on a worker node. However, if you'd like to avoid scheduling the function on a head node, you can set the --num-cpus of a head node as 0 when starting Ray: ray start --head --num-cpus=0. Alternatively, you can use the node affinity scheduling strategy to avoid scheduling on a head node.

Explanation:

By default, Ray distributes workloads on nodes with available CPUs. The head node typically has available CPU resources and may run the function by default. To avoid running the function on the head node, you can set the --num-cpus of the head node as 0 when starting Ray. This will prevent workloads from being scheduled on the head node. Alternatively,

Simple in memory job queue in Ray Serve

serve run queue_proxy:app
python client.py

(base) ➜  tmp python client.py