Skip to content

Instantly share code, notes, and snippets.

@actsasgeek
Last active February 12, 2024 12:55
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save actsasgeek/954c73d28503eb67f01d12a12b1e1181 to your computer and use it in GitHub Desktop.
Save actsasgeek/954c73d28503eb67f01d12a12b1e1181 to your computer and use it in GitHub Desktop.
EN685.648 Starter Pack

EN685.648 Data Science

This course requires knowledge of Python and SQL (the requirement is listed in the course description). If you do not know Python, you will not do well and the course will be that much harder.

Instructors

For the Fall 2023 Semester, there are three sections of Data Science being offered. There are different Primary/Secondary Instructors and Chat Platforms for each Primary Instructor:

Section Primary Secondary Chat Platform
81 Butcher Stewart Slack
82 Stewart Butcher Teams
83 Butcher Stewart Slack

Please install Slack or Teams accordingly.

Infrastructure

  1. You are very strongly encouraged to use a computer upon which you have administrator/superuser privileges. I cannot help you with problems associated with the installation of software and libraries.
  2. You are encouraged to use "Unix"-style operating system (MacOS or Linux flavor) either directly or in a virual environment (Docker or VirtualBox). It's not required but you should be multi-hosted when it comes to OSes and the examples of command line utilities will be in 'Nix. This is not necessary to excel in the class but it is helpful. Many platforms are built on Linux and you should learn to use it.
  3. Ideally, you should have your environment up and running before the semester starts but no later than the 2nd day of class (that first Friday). There is a test assignment due that day.

Setup

  1. Install Anaconda for Python for your operating system: Anaconda. Use the latest.
  2. Set libmamba to be the default installer: $ conda config --set solver libmamba
  3. Create a directory/folder for data science and move into it.
  4. Download environment.yml into your directory (or just copy the Raw content, paste it into a file named environment.yml, and save it).
  5. Execute conda env create -f environment.yml
  6. You now have all the libraries needed for the course (as of now).
  7. Execute conda activate en685648 (whenver working in that environment for any reason, activate it!).
  8. Set up Jupyter notebook to use this environment: python -m ipykernel install --user --name en685648 --display-name "Python (en685648)"

For now, the only thing in this directory will be the environment.yml file.

NB: you must install the specified version of python-duckdb. Database formats between versions are not compatible.

If you have an error setting the solver, you have an older version of Conda. Please update.

Workflow

Once the class has started, you will be able to download the Jupyter notebooks for each module. In the interim, you may want to get a feel for the enviroment in which you'll be working. Use the following commands:

  1. conda activate en685648 - this will activate the environment. (Use conda env list to see your installed environments).
  2. jupyter notebook - this will start the Jupyter notebook environment with the current directory as the root.
  3. When you create a new Jupyter notebook, you can select "Python (en685648)" as the kernel.

Note - jupyter notebook is eventually be "sunsetted" in favor of jupyter lab. If you want to use jupyter lab, that's fine.

When you're done, you can invoke conda deactivate.

NB: You "must" use this Anaconda environment and the "en685648" kernel for this class. Failure to do so has consequences that are your responsibility. These consequences may include getting a zero on assignments.

Do not use a regular code editor for you assignments. Instead use something that can correctly edit and display Jupyter notebooks (ie, Jupyter Notebooks, VS Code, etc.).

References

The students taking this course come from a variety of backgrounds. While the course itself covers a lot of topics, you will have an easier time of it during the semester if you do a bit of preparation on your own. These links will get you started but feel free to explore other resources using Google.

  1. Python 3
  2. Jupyter notebooks (YouTube)
    1. Jupyter Notebooks for Beginners (blog)
    2. Advanced Jupyter Notebooks (blog)
  3. Markdown
  4. Pandas
  5. Matplotlib

Codewars is a great way to increase fluency with Python. Look for idiomatic Python solutions.

This course is not primarily a coding course and Data Science is not primarily about running code. Data Science is about analysis and communication. Style, usage, and organization matter. You must be equally adept at using the Markdown and Code cells in the Jupyter notebook. If nothing else, learn to use use Markdown effectively. Additionally, tabulate has been included to help with the creation of tables.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment