lesteve/dask-tutorial-chat-2020-06-19.md

## dask-tutorial-chat-2020-06-19.md

      
    Raw
  

              dask-tutorial-chat-2020-06-19.md
            
          
    01:36:09	Matthew Rocklin ❯ 	Afternoon!

01:36:18	garanews ❯ 	Hello there

01:36:23	Arnab Biswas ❯ 	Hello!

01:40:28	robbie ❯ 	Hi all, slightly weird question but is anyone talking right now? Just checking to see if my sound is working

01:41:15	Jonas Bostelmann ❯ 	Hi, nobody is talking at the moment! :-)

01:41:52	robbie ❯ 	Great, thanks!

01:44:31	chelle gentemann ❯ 	i feel like hold music should be playing.

01:45:04	Jacob Tomlinson ❯ 	Oooh good idea for next time!

01:45:15	chelle gentemann ❯ 	someone could sing...

01:46:18	Matthew Rocklin ❯ 	+1

01:46:35	Matthew Rocklin ❯ 	I would also accept witty banter

01:48:29	Jacob Tomlinson ❯ 	bit.ly/2Yehz4Y

01:52:50	Matthew Rocklin ❯ 	Hi Everyone!  I'm here to help.  If you have a question about the tutorial and are comfortable, please feel free to use the public chat (if you have a question, it's likely that others in the group have the same question).  If you prefer to ask privately feel free to message only me.  There is a "To: Everyone" drop down menu in the chat menu that you can use to select people to chat with privately.

01:53:54	Naresh ❯ 	Does Dask support Spark clusters

01:55:36	Matthew Rocklin ❯ 	If you want to launch this binder, the direct link is here: https://mybinder.org/v2/gh/jacobtomlinson/dask-video-tutorial-2020/master?urlpath=lab

01:56:38	Arnab Biswas ❯ 	Resolution is still a problem for me. :-(

01:56:54	Arnab Biswas ❯ 	Yes

01:57:13	Petar Todorov ❯ 	screen is still blurry to me sorry

01:57:14	Matthew Rocklin ❯ 	Great question Naresh!  You can launch Dask on any cluster designed for Spark/Mapreduce using the Dask-Yarn project.  Docs are available at yarn.dask.org

01:57:52	Matthew Rocklin ❯ 	For more information on Dask and Spark you may want to read through https://docs.dask.org/en/latest/spark.html

01:58:32	Naresh ❯ 	Thanks. I will check it out.

01:59:17	Petar Todorov ❯ 	do dask support panda's query ?

02:00:10	Luis Gonzalez ❯ 	when the CSV or CSV files you are trying to analyze is way bigger than your memory available; what is the dev pattern Dask configuration I need to take care in order to be able to process the data?

02:00:11	Matthew Rocklin ❯ 	Hi Peter!  Yes, you can see a full list of Dask Dataframe's API here: https://docs.dask.org/en/latest/dataframe-api.html

02:00:19	Petar Todorov ❯ 	thanks

02:01:10	akshay ❯ 	Can I lazy load multiple csvs and then output a single csv with a subset of columns?

02:01:16	J Emmanuel Johnson ❯ 	So once you've called the .compute() but you haven't assigned it to a variable, is it now taking some of your memory?

02:01:46	J Emmanuel Johnson ❯ 	perfect, thanks!

02:01:53	Matthew Rocklin ❯ 	Good question Luis and Akshay.  Yes, Dask dataframe is often used to read many large CSV files.  dask.dataframe.read_csv supports most of the same keyword arguments in pandas.read_csv. https://docs.dask.org/en/latest/dataframe-create.html

02:02:10	davidrpugh ❯ 	How does the def detect the client?

02:02:21	akshay ❯ 	Awesome, thanks Matthew :)

02:02:47	davidrpugh ❯ 	Sorry, how does the ddf detect the underlying client? What happens if you don’t manually specify a client?

02:02:57	Matthew Rocklin ❯ 	Tricky question David :)  When you set up a Client it modifies the global Dask scheduler that is used by default

02:03:38	Matthew Rocklin ❯ 	If you don't create a client then Dask uses a local threadpool

02:04:18	Matthew Rocklin ❯ 	There is more information about how Dask handles different schedulers like the dask.distributed.Client and a local thread pool here: https://docs.dask.org/en/latest/scheduling.html

02:04:36	Joel Branch ❯ 	Just curious - what’s the largest known Desk cluster..being used in production? Even just an estimate would be cool

02:05:55	akshay ❯ 	Absolutely love RAPIDS

02:06:00	Lorea ❯ 	what do I need to install to see the Performance diagnostic boards?

02:06:29	Matthew Rocklin ❯ 	Hi Joel!  In practice most problems can be resolved with 10-100 machines.  That being said we do see people using 1000 machines in some cases.  Dask appears to break down around the 10000 machine level.  There is a good issue on this here  showing Dask fail to use Summit, the worlds fastest supercomputer: dask/distributed#3691

02:06:55	Joel Branch ❯ 	Thanks sir!

02:07:13	Matthew Rocklin ❯ 	Hi Lorea!  For this tutorial we're using Binder, which should handle the setup for you.  https://mybinder.org/v2/gh/jacobtomlinson/dask-video-tutorial-2020/master?urlpath=lab

02:07:46	Matthew Rocklin ❯ 	But if you're asking about local use Dask will run the performance dashboard by default if you have installed the bokeh package.

02:07:56	Matthew Rocklin ❯ 	This is hosted by default at http://localhost:8787/status by default

02:08:15	Lorea ❯ 	thank you Matthew, am using my own environment on cloud ( cloned the repo you sent) and would like to build the boards there

02:08:37	Matthew Rocklin ❯ 	If you want the dashboard to be integrated into JupyterLab then you should take a look at the dask-labextension package: https://github.com/dask/dask-labextension

02:08:45	Matthew Rocklin ❯ 	There is a video there that should help you set things up

02:08:53	J Emmanuel Johnson ❯ 	A side comment: Most use cases I've seen is in JupyterLab. Can you link some examples where dask is inside an ETL framework using scripts? Including spawning and closing the distributed-clients as needed.

02:08:55	Lorea ❯ 	great ! thanx

02:10:21	Matthew Rocklin ❯ 	Good point J Emmanuel.  There isn't much difference between using Dask in a notebook and running Dask in a script.  We tend to use notebooks just because they show outputs nicely.

02:10:33	Matthew Rocklin ❯ 	Is there a part of working in scripts that you're curious about in particular?

02:12:16	Matthew Rocklin ❯ 	I'm curious, how many people have managed to launch Dask with the Client() constructor and can see the dashboards lightup on the side?

02:12:31	Matthew Rocklin ❯ 	Or more importantly, how many people can't get this to work?

02:12:35	J Emmanuel Johnson ❯ 	I just some best practices from experienced programmers for opening and closing clients as needed. Too many times I've caught myself with open clients or too many spawned processes, etc.

02:13:15	Matthew Rocklin ❯ 	There is a little "More ..." option in the chat.   Maybe a Thumbs Up if you're doing well and a Thumbs Down if you're still having trouble?

02:14:54	davidrpugh ❯ 	Getting an error trying to open the file “ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 2”

02:15:31	roberto ❯ 	you should use sep=" "

02:15:41	Karan ❯ 	Sorry, where can i find the go faster button?

02:16:06	roberto ❯ 	I can'r find the faster button

02:16:17	gennadiy.donchyts@gmail.com ❯ 	there is the faster button? :)

02:16:50	gennadiy.donchyts@gmail.com ❯ 	yes

02:16:52	gennadiy.donchyts@gmail.com ❯ 	thx

02:16:58	Robbie ❯ 	Found it!

02:17:01	Luis Gonzalez ❯ 	thanks!

02:18:34	Matthew Rocklin ❯ 	That was fun everyone.  Thanks!

02:24:42	Matthew Rocklin ❯ 	Ooh, fancy

02:24:45	Luis Gonzalez ❯ 	side question: jupyterlab extension; in case I already have another cluster running; can I connect to the dashboard using the jupyterlab extension ? or only works for clusters already created with the extension?

02:25:16	akshay ❯ 	How do I move from jupyter notebook format to jupyterlab?

02:26:38	Matthew Rocklin ❯ 	Luis: You can connect the JupyterLab extension to any running Dask cluster by placing the address of the dashbaord into the top bar of the Dask sidebar.  You can also press the Magnifying Glass icon to have Dask try to search the local notebook for your dashboard.  Specifying the dashboard address manually is more robust though.

02:26:57	Konstantin ❯ 	I think substitute 'tree' to 'lab' in URL

02:26:58	Jonas Bostelmann ❯ 	Change "tree" to "lab" in the URL

02:27:12	akshay ❯ 	Thank you :)

02:27:26	Luis Gonzalez ❯ 	thanks Matthew will try it

02:27:34	Matthew Rocklin ❯ 	Akshay: on disk they're both the same kind of file, so if you have notebooks today you can use them with JupyterLab without reformatting them

02:28:26	Matthew Rocklin ❯ 	Just run jupyter lab rather than jupyter notebook

02:28:28	Luis Gonzalez ❯ 	recording of this session will be available?

02:29:11	Matthew Rocklin ❯ 	That's the intention :)

02:29:16	Deepan Manoharan ❯ 	what is the largest Dask deployment in production as on today? Do we have references/case study?

02:29:29	Matthew Rocklin ❯ 	We're recording now, but this is our first time recording with Zoom.  Hopefully everything works out well:)

02:29:48	Luis Gonzalez ❯ 	great!

02:30:18	Matthew Rocklin ❯ 	Hi Deepan!   In practice most problems can be resolved with 10-100 machines. That being said we do see people using 1000 machines in some cases. Dask appears to break down around the 10000 machine level. There is a good issue on this here showing Dask fail to use Summit, the worlds fastest supercomputer: dask/distributed#3691

02:31:09	Deepan Manoharan ❯ 	Thanks.

02:32:02	Matthew Rocklin ❯ 	Following along with Jacob's talk, here is the Dask Array API page: https://docs.dask.org/en/latest/array-api.html

02:32:57	chelle gentemann ❯ 	how should we use the .visualize() ?  what are we looking for ?

02:33:31	Matthew Rocklin ❯ 	At this point I think that we're using .visualize() mostly to build intuition around how Dask breaks down commands into task graphs

02:33:54	roberto ❯ 	Hi Matthew, what about distributing user defined function within the cluster? Is is enough to define them inside the notebook or we should somehow pack inside a docker?

02:34:20	Matthew Rocklin ❯ 	But in practice  visualize can also be helpful to help understand a bit about your computation.  Is it embarrassingly parallel?  Are there bottlenecks?  And so on.  More understanding performance is here: https://docs.dask.org/en/latest/understanding-performance.html

02:34:29	Jonathan MacCarthy ❯ 	@chelle, I’ve used it to see where I might be able to break up a large graph into smaller batches, if I’m ever having trouble computing a naively large graph.

02:34:39	roberto ❯ 	I'm thinking in particular abouth this https://github.com/rsignell-usgs/sagemaker-fargate-test/blob/master/README.md

02:35:24	Matthew Rocklin ❯ 	Great question Roberto: for small functions defined in a notebook Dask will handle it.  If you have a local script that you want to move around, see the Client.upload_file method.  For larger software packages though yes, you should distributed those packages with the software that you distribute to workers, like with docker

02:37:09	Matthew Rocklin ❯ 	Dask-ML docs are at https://ml.dask.org

02:38:12	roberto ❯ 	thank you!

02:38:43	Joel Branch ❯ 	Just fyi - the “dark + tensor flow” link is broke

02:38:53	Joel Branch ❯ 	“Dask” _ tensor flow :-)

02:38:53	Deepan Manoharan ❯ 	Hi Matthew,If the data doesn't fit in the memory of the machines in the cluster, how does Dask handle the data in such situation?

02:39:18	Matthew Rocklin ❯ 	Joel: Yes, that matches the current poor state of the Dask + Tensorflow integration :(

02:39:51	Matthew Rocklin ❯ 	It's probably the least maintained dask integration library

02:40:21	Arnab Biswas ❯ 	Is it possible to use Dask for feature engineering (mostly for different Time Series related features)?

02:40:31	Matthew Rocklin ❯ 	Good question Deepan: Ideally Dask can find a way to load in chunks of your data, process those chunks, and then release those chunks so that it can process large datasets without keeping much of it in memory at once

02:41:01	Deepan Manoharan ❯ 	Thanks Matthew

02:41:10	Matthew Rocklin ❯ 	If that doesn't work (some workloads can't be processed in this way) then Dask will write data to disk temporarily, which reduces performance, but does let you finish (although slowly)

02:41:18	davidrpugh ❯ 	What is the benefit of using Dask as a parallel backend instead of the default joblib parallel backend when parallelizing scikit-learn code?

02:42:21	Jonathan MacCarthy ❯ 	@davidrpugh At least the dashboard:-)

02:42:21	Matthew Rocklin ❯ 	Arnab: Yes, people often use Dask for feature engineering, either with the Dask + Pandas integration in dask.dataframe, or with other libraries that are Dask enabled.  Three examples of downstream libraries in the timeseries space are Facebook's Prophet, Blue Yonder's TSFresh, and Alteryx's FeatureTools

02:42:56	Matthew Rocklin ❯ 	David: Using Dask will let you use many machines, rather than just threads/processes on one machine

02:43:00	Matthew Rocklin ❯ 	And yes, the dashboard :)

02:43:04	Arnab Biswas ❯ 	Thank you Matthew, that answers my question.

02:43:46	Joel Branch ❯ 	So we should note that this “mean_squared_error” automatically returns a computed result. Is that true for everything in the metrics? I guess it makes sense, but I generally want to understand when computation is happening

02:43:49	akshay ❯ 	Does dask not work with columns having multiple datatypes? Is there a way to overcome this?

02:44:25	kharude ❯ 	Is it possible to build pipeline using Dask ?

02:46:17	Matthew Rocklin ❯ 	Akshay: generally Dask Dataframe follows the Pandas API (Dask dataframes are just  lots of Pandas dataframes)

02:46:27	davidrpugh ❯ 	So if I was doing hyper-parameter tuning using scikit-learn with Dask as my parallel backend I could access the resources of my entire Dask client rather than just the cores on a single machine (which is what job lib would use by default).

02:46:34	Matthew Rocklin ❯ 	So you can certainly have many different columns with different dtypes.

02:46:54	Matthew Rocklin ❯ 	You can also have columns with complex dtypes, like tuples and dicts, but these are a bit slower in Pandas, so they're a bit slower in Dask Dataframe

02:47:17	Jacob Tomlinson ❯ 	davidrpugh: yes

02:47:30	akshay ❯ 	Okay. I was facing some issues regarding it. Is there a forum for dask I can ask such questions?

02:47:33	Matthew Rocklin ❯ 	Kharude yes, by pipeline you might be referring to a few things.  There are Scikit-Learn pipelines, which dask-ml supports

02:47:44	akshay ❯ 	Also, thnk you Matthew

02:47:48	Matthew Rocklin ❯ 	There are also streaming analytics pipelines, for which you might take a look at streamz or dask futures

02:48:27	kharude ❯ 	Thanks Matthew

02:48:31	Matthew Rocklin ❯ 	Akshay: for general usage questions we currently recommend Stack Overflow

02:48:39	Matthew Rocklin ❯ 	Make sure that you tag with the #dask tag

02:48:58	Tom Baldwin ❯ 	What signals should I look for in the dashboard panels to know whether the computation is using my cluster efficiently? For example, if I see a lot of a certain color in the task stream / cluster map, is that bad?

02:48:58	Matthew Rocklin ❯ 	There is also discussion of switching over to discourse, but that's not yet as actively managed

02:49:55	akshay ❯ 	Awesome. Thanks a ton Matthew :)

02:50:02	Jacob Tomlinson ❯ 	Tom Baldwin: We will have a look at the profiler tool shortly, which can help in these situations

02:50:27	Jacob Tomlinson ❯ 	Also keep an eye out for red, which shows inter-worker communication and orange which shows swapping to disk. Both of which are bottlenecks.

02:50:29	Matthew Rocklin ❯ 	Ooh, great question Tom. Understanding the dashboard is critical to making sure that you're using your hardware efficiently.
I probably can't answer your question entirely here, but you may want to look at https://docs.dask.org/en/latest/diagnostics-distributed.html

02:50:41	Matthew Rocklin ❯ 	There is also a good youtube video there that goes over things in depth

02:50:48	Matthew Rocklin ❯ 	Although you'll also get the same experience here in just a moment :)

02:51:25	Harshaka Perera ❯ 	any performance benchmarks available with certain set of data size/types?

02:51:58	Tom Baldwin ❯ 	Thanks!

02:52:23	Matthew Rocklin ❯ 	The Dask community doesn't maintain any active benchmarking suites today.  There is a nice one with Dask + RAPIDS (Jacob's group) that I think they're planning to release soonish?

02:52:51	Matthew Rocklin ❯ 	But in general, Dask Dataframe's performance mirror's Pandas' performance, just on larger data

02:53:12	Matthew Rocklin ❯ 	That's not quite true as you get to more complex operations, like sorting, but it's mostly true

02:56:05	Abdulelah ❯ 	Any best practices or guidelines available on selecting cluster resources (number of workers, threads, chunk size, …) specially with working on a heterogeneous cluster environment?

02:56:14	Luis Gonzalez ❯ 	is the dask.dataframe.read_csv in this example lazy?

02:57:27	Matthew Rocklin ❯ 	Abdulelah: Good question.  Our general best practices are here: https://docs.dask.org/en/latest/best-practices.html
But really what you bring up is a hard question, it depends a lot on your hardware and your problem.

02:57:59	Abdulelah ❯ 	I see … thanks Matthew,

02:58:26	Matthew Rocklin ❯ 	Luis: Yes, in general everything in dask.dataframe is lazy except for head, and methods like to_csv, to_parquet

02:59:37	Luis Gonzalez ❯ 	then the speed to complete the commands depends on what Dask requires (in this case read_csv) to process the files? having a case with 20K+ files that is eating my memory and taking 1hour+ to complete

03:00:09	Luis Gonzalez ❯ 	https://dask.discourse.group/t/having-large-workstation-best-dask-cluster-configuration-to-read-csv-files/38

03:00:40	Matthew Rocklin ❯ 	Dask does a tiny bit of work when setting up the computation.  For example it has to list all of the filenames.  Depending on your filesystem, listing 20k files might take a while :)

03:09:24	Jonathan MacCarthy ❯ 	There’s no reason to thing that client.gather will return results in order, correct?

03:10:53	Matthew Rocklin ❯ 	Jonathan: Correct, client.gather returns results in the same structure in which the futures were given

03:11:01	Matthew Rocklin ❯ 	So yes, same order

03:11:14	Matthew Rocklin ❯ 	But you can also do things like gather a dictionary of futures

03:16:15	Pierre Navaro ❯ 	Dear Matthew, What do you have to install on the ssh workers ? do we need to clone everything  ?  users and software

03:16:48	Matthew Rocklin ❯ 	Good question Pierre: Yes, Dask will expect that the software environment is the same on all of the workers

03:17:03	Matthew Rocklin ❯ 	So ideally you would conda install or pip install the same environment everywhere

03:19:10	Matthew Rocklin ❯ 	More information on how to set up Dask here: https://docs.dask.org/en/latest/setup.html

03:19:28	Jonathan MacCarthy ❯ 	For a heterogeneous collection of machines (using dask SSH), will Dask recognize that a particularly large machine can handle much more work compared to other machines?

03:19:39	Tom Baldwin ❯ 	Is there any difference between configuring a LocalCluster and passing it to the Client vs. manually passing worker counts etc. to the Client?

03:19:44	Matthew Rocklin ❯ 	Jonathan: Yes

03:19:48	Pierre Navaro ❯ 	Is it possible to use coupled with HDFS

03:20:17	Matthew Rocklin ❯ 	Tom: no difference.  If you create a Client without a provided cluster then it will create a LocalCluster for you with those arguments

03:20:33	Akshay ❯ 	Thank you Jacob and Matthew. This was very benificial

03:20:39	Jonathan MacCarthy ❯ 	What is the current deployment solution recommendation for a single researcher wanting a JupyterLab front-end and cluster that can do adaptive scaling (being able to add new machines) in AWS?

03:20:41	chelle gentemann ❯ 	Thanks Jacob & Matthew, this was great and the tutorials are a resource I'll be going to again.  Thanks so much!  The chat questions were really great!

03:21:00	Lorea ❯ 	Awesome , thank you both, for a great session :-)

03:21:01	Joel Branch ❯ 	Jacob and Matt (and the whole Dark team), this was GREAT! Thanks for putting this together.

03:21:01	Tom Baldwin ❯ 	This was wonderful, thank you both!

03:21:08	Karan ❯ 	Thanks, dask team!

03:21:10	Luis Gonzalez ❯ 	great session

03:21:13	Jonathan MacCarthy ❯ 	Thank you!

03:21:13	Abdulelah ❯ 	Thanks

03:21:16	Pierre Navaro ❯ 	MERCI

03:21:16	Joel Branch ❯ 	Have an awesome weekend and Father’s Day

03:21:18	roberto ❯ 	Thanks!

03:21:18	davidrpugh ❯ 	Bye thanks!

03:21:21	Arnab Biswas ❯ 	Thank you !

03:21:31	Arnab Biswas ❯ 	That was awesome!

03:21:31	Harshaka Perera ❯ 	Thank you!

03:21:55	Matthew Rocklin ❯ 	Pierre: Yes, you can use Dask with HDFS.  See https://docs.dask.org/en/latest/remote-data-services.html

03:22:03	Nick Byrne ❯ 	Thank you!