Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lesteve/12042f7df7c3843a9ddde52ccbd0e301 to your computer and use it in GitHub Desktop.
Save lesteve/12042f7df7c3843a9ddde52ccbd0e301 to your computer and use it in GitHub Desktop.
The transcripts of the Dask tutorial presented on 2020-06-19

01:36:09 Matthew Rocklin Afternoon!
01:36:18 garanews Hello there
01:36:23 Arnab Biswas Hello!
01:40:28 robbie Hi all, slightly weird question but is anyone talking right now? Just checking to see if my sound is working
01:41:15 Jonas Bostelmann Hi, nobody is talking at the moment! :-)
01:41:52 robbie Great, thanks!
01:44:31 chelle gentemann i feel like hold music should be playing.
01:45:04 Jacob Tomlinson Oooh good idea for next time!
01:45:15 chelle gentemann someone could sing...
01:46:18 Matthew Rocklin +1
01:46:35 Matthew Rocklin I would also accept witty banter
01:48:29 Jacob Tomlinson bit.ly/2Yehz4Y
01:52:50 Matthew Rocklin Hi Everyone! I'm here to help. If you have a question about the tutorial and are comfortable, please feel free to use the public chat (if you have a question, it's likely that others in the group have the same question). If you prefer to ask privately feel free to message only me. There is a "To: Everyone" drop down menu in the chat menu that you can use to select people to chat with privately.
01:53:54 Naresh Does Dask support Spark clusters
01:55:36 Matthew Rocklin If you want to launch this binder, the direct link is here: https://mybinder.org/v2/gh/jacobtomlinson/dask-video-tutorial-2020/master?urlpath=lab
01:56:38 Arnab Biswas Resolution is still a problem for me. :-(
01:56:54 Arnab Biswas Yes
01:57:13 Petar Todorov screen is still blurry to me sorry
01:57:14 Matthew Rocklin Great question Naresh! You can launch Dask on any cluster designed for Spark/Mapreduce using the Dask-Yarn project. Docs are available at yarn.dask.org
01:57:52 Matthew Rocklin For more information on Dask and Spark you may want to read through https://docs.dask.org/en/latest/spark.html
01:58:32 Naresh Thanks. I will check it out.
01:59:17 Petar Todorov do dask support panda's query ?
02:00:10 Luis Gonzalez when the CSV or CSV files you are trying to analyze is way bigger than your memory available; what is the dev pattern Dask configuration I need to take care in order to be able to process the data?
02:00:11 Matthew Rocklin Hi Peter! Yes, you can see a full list of Dask Dataframe's API here: https://docs.dask.org/en/latest/dataframe-api.html
02:00:19 Petar Todorov thanks
02:01:10 akshay Can I lazy load multiple csvs and then output a single csv with a subset of columns?
02:01:16 J Emmanuel Johnson So once you've called the .compute() but you haven't assigned it to a variable, is it now taking some of your memory?
02:01:46 J Emmanuel Johnson perfect, thanks!
02:01:53 Matthew Rocklin Good question Luis and Akshay. Yes, Dask dataframe is often used to read many large CSV files. dask.dataframe.read_csv supports most of the same keyword arguments in pandas.read_csv. https://docs.dask.org/en/latest/dataframe-create.html
02:02:10 davidrpugh How does the def detect the client?
02:02:21 akshay Awesome, thanks Matthew :)
02:02:47 davidrpugh Sorry, how does the ddf detect the underlying client? What happens if you don’t manually specify a client?
02:02:57 Matthew Rocklin Tricky question David :) When you set up a Client it modifies the global Dask scheduler that is used by default
02:03:38 Matthew Rocklin If you don't create a client then Dask uses a local threadpool
02:04:18 Matthew Rocklin There is more information about how Dask handles different schedulers like the dask.distributed.Client and a local thread pool here: https://docs.dask.org/en/latest/scheduling.html
02:04:36 Joel Branch Just curious - what’s the largest known Desk cluster..being used in production? Even just an estimate would be cool
02:05:55 akshay Absolutely love RAPIDS
02:06:00 Lorea what do I need to install to see the Performance diagnostic boards?
02:06:29 Matthew Rocklin Hi Joel! In practice most problems can be resolved with 10-100 machines. That being said we do see people using 1000 machines in some cases. Dask appears to break down around the 10000 machine level. There is a good issue on this here showing Dask fail to use Summit, the worlds fastest supercomputer: dask/distributed#3691
02:06:55 Joel Branch Thanks sir!
02:07:13 Matthew Rocklin Hi Lorea! For this tutorial we're using Binder, which should handle the setup for you. https://mybinder.org/v2/gh/jacobtomlinson/dask-video-tutorial-2020/master?urlpath=lab
02:07:46 Matthew Rocklin But if you're asking about local use Dask will run the performance dashboard by default if you have installed the bokeh package.
02:07:56 Matthew Rocklin This is hosted by default at http://localhost:8787/status by default
02:08:15 Lorea thank you Matthew, am using my own environment on cloud ( cloned the repo you sent) and would like to build the boards there
02:08:37 Matthew Rocklin If you want the dashboard to be integrated into JupyterLab then you should take a look at the dask-labextension package: https://github.com/dask/dask-labextension
02:08:45 Matthew Rocklin There is a video there that should help you set things up
02:08:53 J Emmanuel Johnson A side comment: Most use cases I've seen is in JupyterLab. Can you link some examples where dask is inside an ETL framework using scripts? Including spawning and closing the distributed-clients as needed.
02:08:55 Lorea great ! thanx
02:10:21 Matthew Rocklin Good point J Emmanuel. There isn't much difference between using Dask in a notebook and running Dask in a script. We tend to use notebooks just because they show outputs nicely.
02:10:33 Matthew Rocklin Is there a part of working in scripts that you're curious about in particular?
02:12:16 Matthew Rocklin I'm curious, how many people have managed to launch Dask with the Client() constructor and can see the dashboards lightup on the side?
02:12:31 Matthew Rocklin Or more importantly, how many people can't get this to work?
02:12:35 J Emmanuel Johnson I just some best practices from experienced programmers for opening and closing clients as needed. Too many times I've caught myself with open clients or too many spawned processes, etc.
02:13:15 Matthew Rocklin There is a little "More ..." option in the chat. Maybe a Thumbs Up if you're doing well and a Thumbs Down if you're still having trouble?
02:14:54 davidrpugh Getting an error trying to open the file “ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2”
02:15:31 roberto you should use sep=" "
02:15:41 Karan Sorry, where can i find the go faster button?
02:16:06 roberto I can'r find the faster button
02:16:17 gennadiy.donchyts@gmail.com there is the faster button? :)
02:16:50 gennadiy.donchyts@gmail.com yes
02:16:52 gennadiy.donchyts@gmail.com thx
02:16:58 Robbie Found it!
02:17:01 Luis Gonzalez thanks!
02:18:34 Matthew Rocklin That was fun everyone. Thanks!
02:24:42 Matthew Rocklin Ooh, fancy
02:24:45 Luis Gonzalez side question: jupyterlab extension; in case I already have another cluster running; can I connect to the dashboard using the jupyterlab extension ? or only works for clusters already created with the extension?
02:25:16 akshay How do I move from jupyter notebook format to jupyterlab?
02:26:38 Matthew Rocklin Luis: You can connect the JupyterLab extension to any running Dask cluster by placing the address of the dashbaord into the top bar of the Dask sidebar. You can also press the Magnifying Glass icon to have Dask try to search the local notebook for your dashboard. Specifying the dashboard address manually is more robust though.
02:26:57 Konstantin I think substitute 'tree' to 'lab' in URL
02:26:58 Jonas Bostelmann Change "tree" to "lab" in the URL
02:27:12 akshay Thank you :)
02:27:26 Luis Gonzalez thanks Matthew will try it
02:27:34 Matthew Rocklin Akshay: on disk they're both the same kind of file, so if you have notebooks today you can use them with JupyterLab without reformatting them
02:28:26 Matthew Rocklin Just run jupyter lab rather than jupyter notebook
02:28:28 Luis Gonzalez recording of this session will be available?
02:29:11 Matthew Rocklin That's the intention :)
02:29:16 Deepan Manoharan what is the largest Dask deployment in production as on today? Do we have references/case study?
02:29:29 Matthew Rocklin We're recording now, but this is our first time recording with Zoom. Hopefully everything works out well:)
02:29:48 Luis Gonzalez great!
02:30:18 Matthew Rocklin Hi Deepan! In practice most problems can be resolved with 10-100 machines. That being said we do see people using 1000 machines in some cases. Dask appears to break down around the 10000 machine level. There is a good issue on this here showing Dask fail to use Summit, the worlds fastest supercomputer: dask/distributed#3691
02:31:09 Deepan Manoharan Thanks.
02:32:02 Matthew Rocklin Following along with Jacob's talk, here is the Dask Array API page: https://docs.dask.org/en/latest/array-api.html
02:32:57 chelle gentemann how should we use the .visualize() ? what are we looking for ?
02:33:31 Matthew Rocklin At this point I think that we're using .visualize() mostly to build intuition around how Dask breaks down commands into task graphs
02:33:54 roberto Hi Matthew, what about distributing user defined function within the cluster? Is is enough to define them inside the notebook or we should somehow pack inside a docker?
02:34:20 Matthew Rocklin But in practice visualize can also be helpful to help understand a bit about your computation. Is it embarrassingly parallel? Are there bottlenecks? And so on. More understanding performance is here: https://docs.dask.org/en/latest/understanding-performance.html
02:34:29 Jonathan MacCarthy @chelle, I’ve used it to see where I might be able to break up a large graph into smaller batches, if I’m ever having trouble computing a naively large graph.
02:34:39 roberto I'm thinking in particular abouth this https://github.com/rsignell-usgs/sagemaker-fargate-test/blob/master/README.md
02:35:24 Matthew Rocklin Great question Roberto: for small functions defined in a notebook Dask will handle it. If you have a local script that you want to move around, see the Client.upload_file method. For larger software packages though yes, you should distributed those packages with the software that you distribute to workers, like with docker
02:37:09 Matthew Rocklin Dask-ML docs are at https://ml.dask.org
02:38:12 roberto thank you!
02:38:43 Joel Branch Just fyi - the “dark + tensor flow” link is broke
02:38:53 Joel Branch “Dask” _ tensor flow :-)
02:38:53 Deepan Manoharan Hi Matthew,If the data doesn't fit in the memory of the machines in the cluster, how does Dask handle the data in such situation?
02:39:18 Matthew Rocklin Joel: Yes, that matches the current poor state of the Dask + Tensorflow integration :(
02:39:51 Matthew Rocklin It's probably the least maintained dask integration library
02:40:21 Arnab Biswas Is it possible to use Dask for feature engineering (mostly for different Time Series related features)?
02:40:31 Matthew Rocklin Good question Deepan: Ideally Dask can find a way to load in chunks of your data, process those chunks, and then release those chunks so that it can process large datasets without keeping much of it in memory at once
02:41:01 Deepan Manoharan Thanks Matthew
02:41:10 Matthew Rocklin If that doesn't work (some workloads can't be processed in this way) then Dask will write data to disk temporarily, which reduces performance, but does let you finish (although slowly)
02:41:18 davidrpugh What is the benefit of using Dask as a parallel backend instead of the default joblib parallel backend when parallelizing scikit-learn code?
02:42:21 Jonathan MacCarthy @davidrpugh At least the dashboard:-)
02:42:21 Matthew Rocklin Arnab: Yes, people often use Dask for feature engineering, either with the Dask + Pandas integration in dask.dataframe, or with other libraries that are Dask enabled. Three examples of downstream libraries in the timeseries space are Facebook's Prophet, Blue Yonder's TSFresh, and Alteryx's FeatureTools
02:42:56 Matthew Rocklin David: Using Dask will let you use many machines, rather than just threads/processes on one machine
02:43:00 Matthew Rocklin And yes, the dashboard :)
02:43:04 Arnab Biswas Thank you Matthew, that answers my question.
02:43:46 Joel Branch So we should note that this “mean_squared_error” automatically returns a computed result. Is that true for everything in the metrics? I guess it makes sense, but I generally want to understand when computation is happening
02:43:49 akshay Does dask not work with columns having multiple datatypes? Is there a way to overcome this?
02:44:25 kharude Is it possible to build pipeline using Dask ?
02:46:17 Matthew Rocklin Akshay: generally Dask Dataframe follows the Pandas API (Dask dataframes are just lots of Pandas dataframes)
02:46:27 davidrpugh So if I was doing hyper-parameter tuning using scikit-learn with Dask as my parallel backend I could access the resources of my entire Dask client rather than just the cores on a single machine (which is what job lib would use by default).
02:46:34 Matthew Rocklin So you can certainly have many different columns with different dtypes.
02:46:54 Matthew Rocklin You can also have columns with complex dtypes, like tuples and dicts, but these are a bit slower in Pandas, so they're a bit slower in Dask Dataframe
02:47:17 Jacob Tomlinson davidrpugh: yes
02:47:30 akshay Okay. I was facing some issues regarding it. Is there a forum for dask I can ask such questions?
02:47:33 Matthew Rocklin Kharude yes, by pipeline you might be referring to a few things. There are Scikit-Learn pipelines, which dask-ml supports
02:47:44 akshay Also, thnk you Matthew
02:47:48 Matthew Rocklin There are also streaming analytics pipelines, for which you might take a look at streamz or dask futures
02:48:27 kharude Thanks Matthew
02:48:31 Matthew Rocklin Akshay: for general usage questions we currently recommend Stack Overflow
02:48:39 Matthew Rocklin Make sure that you tag with the #dask tag
02:48:58 Tom Baldwin What signals should I look for in the dashboard panels to know whether the computation is using my cluster efficiently? For example, if I see a lot of a certain color in the task stream / cluster map, is that bad?
02:48:58 Matthew Rocklin There is also discussion of switching over to discourse, but that's not yet as actively managed
02:49:55 akshay Awesome. Thanks a ton Matthew :)
02:50:02 Jacob Tomlinson Tom Baldwin: We will have a look at the profiler tool shortly, which can help in these situations
02:50:27 Jacob Tomlinson Also keep an eye out for red, which shows inter-worker communication and orange which shows swapping to disk. Both of which are bottlenecks.
02:50:29 Matthew Rocklin Ooh, great question Tom. Understanding the dashboard is critical to making sure that you're using your hardware efficiently. I probably can't answer your question entirely here, but you may want to look at https://docs.dask.org/en/latest/diagnostics-distributed.html
02:50:41 Matthew Rocklin There is also a good youtube video there that goes over things in depth
02:50:48 Matthew Rocklin Although you'll also get the same experience here in just a moment :)
02:51:25 Harshaka Perera any performance benchmarks available with certain set of data size/types?
02:51:58 Tom Baldwin Thanks!
02:52:23 Matthew Rocklin The Dask community doesn't maintain any active benchmarking suites today. There is a nice one with Dask + RAPIDS (Jacob's group) that I think they're planning to release soonish?
02:52:51 Matthew Rocklin But in general, Dask Dataframe's performance mirror's Pandas' performance, just on larger data
02:53:12 Matthew Rocklin That's not quite true as you get to more complex operations, like sorting, but it's mostly true
02:56:05 Abdulelah Any best practices or guidelines available on selecting cluster resources (number of workers, threads, chunk size, …) specially with working on a heterogeneous cluster environment?
02:56:14 Luis Gonzalez is the dask.dataframe.read_csv in this example lazy?
02:57:27 Matthew Rocklin Abdulelah: Good question. Our general best practices are here: https://docs.dask.org/en/latest/best-practices.html But really what you bring up is a hard question, it depends a lot on your hardware and your problem.
02:57:59 Abdulelah I see … thanks Matthew,
02:58:26 Matthew Rocklin Luis: Yes, in general everything in dask.dataframe is lazy except for head, and methods like to_csv, to_parquet
02:59:37 Luis Gonzalez then the speed to complete the commands depends on what Dask requires (in this case read_csv) to process the files? having a case with 20K+ files that is eating my memory and taking 1hour+ to complete
03:00:09 Luis Gonzalez https://dask.discourse.group/t/having-large-workstation-best-dask-cluster-configuration-to-read-csv-files/38
03:00:40 Matthew Rocklin Dask does a tiny bit of work when setting up the computation. For example it has to list all of the filenames. Depending on your filesystem, listing 20k files might take a while :)
03:09:24 Jonathan MacCarthy There’s no reason to thing that client.gather will return results in order, correct?
03:10:53 Matthew Rocklin Jonathan: Correct, client.gather returns results in the same structure in which the futures were given
03:11:01 Matthew Rocklin So yes, same order
03:11:14 Matthew Rocklin But you can also do things like gather a dictionary of futures
03:16:15 Pierre Navaro Dear Matthew, What do you have to install on the ssh workers ? do we need to clone everything ? users and software
03:16:48 Matthew Rocklin Good question Pierre: Yes, Dask will expect that the software environment is the same on all of the workers
03:17:03 Matthew Rocklin So ideally you would conda install or pip install the same environment everywhere
03:19:10 Matthew Rocklin More information on how to set up Dask here: https://docs.dask.org/en/latest/setup.html
03:19:28 Jonathan MacCarthy For a heterogeneous collection of machines (using dask SSH), will Dask recognize that a particularly large machine can handle much more work compared to other machines?
03:19:39 Tom Baldwin Is there any difference between configuring a LocalCluster and passing it to the Client vs. manually passing worker counts etc. to the Client?
03:19:44 Matthew Rocklin Jonathan: Yes
03:19:48 Pierre Navaro Is it possible to use coupled with HDFS
03:20:17 Matthew Rocklin Tom: no difference. If you create a Client without a provided cluster then it will create a LocalCluster for you with those arguments
03:20:33 Akshay Thank you Jacob and Matthew. This was very benificial
03:20:39 Jonathan MacCarthy What is the current deployment solution recommendation for a single researcher wanting a JupyterLab front-end and cluster that can do adaptive scaling (being able to add new machines) in AWS?
03:20:41 chelle gentemann Thanks Jacob & Matthew, this was great and the tutorials are a resource I'll be going to again. Thanks so much! The chat questions were really great!
03:21:00 Lorea Awesome , thank you both, for a great session :-)
03:21:01 Joel Branch Jacob and Matt (and the whole Dark team), this was GREAT! Thanks for putting this together.
03:21:01 Tom Baldwin This was wonderful, thank you both!
03:21:08 Karan Thanks, dask team!
03:21:10 Luis Gonzalez great session
03:21:13 Jonathan MacCarthy Thank you!
03:21:13 Abdulelah Thanks
03:21:16 Pierre Navaro MERCI
03:21:16 Joel Branch Have an awesome weekend and Father’s Day
03:21:18 roberto Thanks!
03:21:18 davidrpugh Bye thanks!
03:21:21 Arnab Biswas Thank you !
03:21:31 Arnab Biswas That was awesome!
03:21:31 Harshaka Perera Thank you!
03:21:55 Matthew Rocklin Pierre: Yes, you can use Dask with HDFS. See https://docs.dask.org/en/latest/remote-data-services.html
03:22:03 Nick Byrne Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment