Forem: RCKettel

A discussion of the history of DS methodologies

RCKettel — Mon, 16 Nov 2020 19:33:44 +0000

I was curious about the history of data science and so looked into the influence data science has had on other schools of thought. While skimming through a list of scientific papers, I found one that was of interest to me so I thought I might discuss it and its contents. The paper is called Big Data, New Epistemologies, and Paradigm Shifts. In it the author is responding to major changes Big Data methodologies have caused in the social sciences and the humanities. He specifically speaks to the new methods that were put forth by practitioners of data science and how these methods and epistemologies affect the fields and their studies.

The first major subject discussed were the claims made by data scientists at the time. Big Data was creating a new age of study in which the volume of data and the methods used allow data to explain its own trends without the necessity of using methods specific to any one social science. Data science is not unique to any one discipline and is being used for both non scientific approaches and trends as well as using them to understand natural phenomena and natural processes. Due to these methods there was and possibly still is culture among data scientists to claim the scientific method and empirical processes specific to a single scientific school such as models or hypotheses are unnecessary when studying large datasets whose framing is extensively holistic, don’t need to have explanations for correlation and trends, and do not need a scientific explanation or further exploration. In addition, the data will show no inherent bias and will not inherently have explanation in its trends without human intervention thus, the patterns and trends are meaningful and factual in their own right and anyone who can read a model can interpret them.

The author asserts these beliefs are misguided. First, due to the fact that any dataset used is a sample of the larger whole that could be accessed cannot be holistic nor can it provide inherently general or completely factual information since there are many factors involved in the collection of the data that introduce bias and data is rarely collected without a given purpose in mind. Second, the algorithms that have been engineered for use were developed and created with the need to answer a question based on a philosophical construct such as checking a hypothesis. Thus, no algorithm used in data science was created without some sense of scientific purpose and were put through extensive scientific rigor to prove their reliability as such they are scientific models. Third, the explanation of data will always be framed in some form of question and the algorithms used for the analysis of this data were created for a particular approach to answering hypotheses and then allowing the results to be interpretable based on said questions. As such these algorithms are used for a particular scientific approach all of which introduces further bias. In addition, patterns illuminated by the algorithms are not inherently meaningful and can be random or unnecessary. Finally, the claim that knowledge in a particular field is unnecessary to interpret results is unrealistic. Data can be interpreted without prior knowledge but it is likely there will be a weakness of findings that would be useful in a greater context. The reality is that though these claims made may be true for practitioners of data science in the unscientific community they oversimplify a more complex approach in order to prove the value of big data analytics.

At this point the author asserted a need for an epistemological process specifically by stating first, Data science uses a different method to define hypotheses withing empirical methods using guided techniques to identify potential questions worth examining. He explained this by stating people who use big data methodologies didn’t use the traditional empirical approach. Rather, meaningful data were generated using preplanned approaches to harvest data and then the materials were strategically evaluated to identify information that was worth further inquiry to people in a given field. In other words, the data collection and evaluation used abductive reasoning to find data using a logical method but not specifically using a definitive approach. Information and relationships in the data were then used to form hypotheses through induction, by studying relationships that became apparent in the data and then making a hypothesis based on those insights. Both he points out are different than the usual scientific method.

This new model was argued by the author to become the new paradigm in the age of Big Data as it is more suited to extracting meaningful data from large datasets where the traditional model is less useful. This was especially true and most challenging to scientists in fields like the humanities where emerging fields such as the Digital Humanities and the Computational Social Sciences have been challenging to the traditional methods used by scientists. Both fields of study have been met with similar resistance from more traditional scholars as the methods used by practitioners of these school of thought may yield results that lack depth and context in terms of the knowledge base and didn’t need any real domain knowledge to be understood. Specifically they argued these studies were too reductivist and took the humanity out of the greater context of the subject by minimizing its role in the studies and ignoring the greater complexities of society when using only quantitative systems of explanation. The author defends these new schools somewhat by stating though there are problems with these methods highlighting these limitations should show their use even if only in reference to other studies that give the patterns context.

The model that is most promising, uses radical statistics and GIS data. The practitioners of this model use current social theory to determine empirical approach and how to frame the eventual results. It accepts research as a system with a human influence that places the information in a particular frame of reference and the researcher is acknowledged as needing prior grounding in the subject matter. It places the research in a greater context and the data shows the original context of the work. In short, the methods accept the necessary reflexivity or self examination in its methods.

The author concluded his work by stating though Big Data is a disruptive influence and is likely going to cause a change in the practical methods used by many in the scientific community. The historic methods used will never really be replaced but they will likely be in conflict with the new data driven empirical methods until there can be a new theoretical framework for the new paradigm being presented. He also points out that there are certain methods that show promise by drawing attention back to the humanities and the challenges BD poses with the lack of necessity of knowledge base but addresses the epistemology of some models that are reflexive and situated in the realities of the scientific community. Due to these points the author stated a need for examination of the methods used in Big Data analytics due to affects in the changing landscape.

I find this paper helpful and difficult at the same time. It points out a need for my own understanding of how to best practice my craft. While I would like to say that I can do whatever I would like with the data presented to me, it is good to remember that there should be some particular guidelines I should pay attention to. Such as, what was the dataset I am looking at originally created for and will it be a good representation of the knowledge that will suit my purpose. Doing this will allow me to avoid any unexpected biases that could influence the outcome of my data. In addition, remembering that I have influence in how his data is framed and what I am using it to discuss is equally important as I could place unnecessary constraints or make decisions that will have unintended consequences. Finally, this article made a good point in that the data should be something the data scientist has grounding in. This will allow the analyst to give greater depth and understanding to their study and get less broad outcomes. I hope this article helps the reader get a better understanding of the importance of basic methodologies. Though they can be difficult they were created for a particular purpose and I hope the article helps to illustrate the reasoning behind them.

If you found this subject interesting a link to the original article can be found here.

Big Data, New epistemologies, and Paradigm Shifts

A brief introduction to Seaborn

RCKettel — Sun, 18 Oct 2020 11:26:02 +0000

Seaborn is a graphing tool that is used within python as a means to display and interpret data. For any data scientist who is going to be displaying their findings or simply making a presentation, a graph that is appealing to the eye and is easy to understand is ideal. While Matplotlib is a useful tool that will help with exploring data and finding relationships it isn’t as easily customizable and doesn’t have as many methods to make the graphs created with it look as presentable. Seaborn can be used for many of the same purposes and since it was built to be a supplement to Matplotlib much of the code necessary to make plots using its library is very similar. In this tutorial we will go over how to make a basic graph and some basic customizations that can be used to make graphs look nicer such as changing x and y labels, adding a title, changing the orientation of the x-ticks, adding a built in style, and removing the error bars from the graph.

The data for this tutorial is taken from a dataset called Pokemon With Stats. It contains data on 721 Pokemon up to the sixth generation including: names, numbers, types, basic stats, what generation the Pokemon is from, and whether or not the it is legendary. I chose this particular dataset since its subject matter is well known and as such will be easy for someone to relate to even if they have only a passing exposure to the subject matter.
The libraries that will be used are Matplotlib and Seaborn so they should be imported before the graph is created.

import matplotlib as plt
import seaborn as sns

We will begin with a basic barplot by entering the code:

sns.barplot('Type 1', 'Defense', data=pokemon);

which will create a graph that looks like this:

Even thought this graph has its flaws, without needing to insert X or Y labels or do anything special to the colors in the bars of the graph they already appear more pleasing to the eye. There is a lot wrong with this graph however, there is no title, the x_axis and y_axis labels should be changed, the x_axis tick labels are unreadable, there is a bunch of whitespace in the plot’s background and what are those black lines at the top of each of the bars? All of this will be changed. First, we will add a title. There are two ways to do this one is to use .set(“title”) after the final parenthesis in the above code. The other uses code that should already be known if someone is familiar with Matplotlib.

sns.barplot('Type 1', 'Defense', data=pokemon).set(title= 'Pokemon Defense by Type');

sns.barplot('Type 1', 'Defense', data=pokemon)
plt.title('Defense by Type')

In either case the graph should look like this:

Now that the title has been added. The x and y axis labels should be changed to better communicate what information is being portrayed in the graph. This can be done by adding to the `.set(“title”) method or by using the Matplotlib style coding.

python '''.set("title") method adds labels after the title''' sns.barplot('Type1','Defense',data=pokemon).set(title='Pokemon Defense by Type', xlabel='Type', ylabel='Defense Statistic'); '''Matplotlib has multiple lines of code''' sns.barplot('Type 1', 'Defense', data=pokemon) plt.title('Pokemon Defense by Type') plt.xlabel('Type') plt.ylabel('Defense Statistic');

As you might have guessed the .set() method can be used equally as well as the .plt() style coding and each to the coders own preference. While one writes horizontally and can be a little long, the other writes its code vertically and needs more lines of code. Since the Seaborn library is built to work with Matplotlib some of the more complicated changes are often done with the latter style of code. For instance, up to this point all of the tick labels on the x_axis have been unreadable. This can be changed by rotating them to a desired angle. The angle of rotation can be any desired angle from 0 to 360 degrees and are set by using the rotation feature in the xticks function:

python sns.barplot('Type 1', 'Defense', data=pokemon) plt.title('Pokemon Defense by Type') plt.xlabel('Type') plt.ylabel('Defense Statistic') plt.xticks(rotation=45);

At this point the graph could be used in a presentation and would be easy to understand. Yet, there are some other changes that could be made to the bar graph that would help refine it further. The background of the graph contains a lot of white space that can be filled in by using one of three built in styles: dark grid, white grid, and dark. There are two other styles, white and ticks, which help make the colors or tick marks more prominent but for the purposes of this tutorial white grid will be demonstrated.

python '''The set_style parameter should be run before the initial plot code.''' sns.set_style('whitegrid') sns.barplot('Type 1', 'Defense', data=pokemon) plt.title('Pokemon Defense by Type') plt.xlabel('Type') plt.ylabel('Defense Statistic') plt.xticks(rotation=45);

Finally, the odd black lines in the bar graph are called error bars and detail the confidence intervals for the particular populations being graphed. These are statistical measures of where within the lower and upper estimates the mean of the population should land. The smaller the range the more precise the estimate. If these are not necessary for the graph or they just seem unsightly they can be removed by adjusting one of the parameters of `sns.barplot’, ci, equal to none.

sns.set_style('whitegrid')
sns.barplot('Type 1', 'Defense', data=pokemon, ci=None)
plt.title('Pokemon Defense by Type')
plt.xlabel('Type')
plt.ylabel('Defense Statistic')
plt.xticks(rotation=45);

Now that the graph looks ready for presentation it should be scaled to match. There are four scales built in to Seaborn and they are appropriately named. In order of size they are paper, notebook, talk, and poster the default being notebook. Since this is going to be used in a talk the scaling should reflect that. Remember that this will affect the entire code of the graph so it should be instantiated before `sns.barplot’ and the size of text will be changed so the rotation of the x-ticks might need to be adjusted for readability.

python sns.set_style('whitegrid') sns.set_context('talk') sns.barplot('Type 1', 'Defense', data=pokemon, ci=None) plt.title('Pokemon Defense by Type') plt.xlabel('Type') plt.ylabel('Defense Statistic') plt.xticks(rotation=65);

As was previously stated, it is always necessary for programmers of all types to be able to give professional and intuitive visualizations in their presentations. Seaborn is an easy alternative to Matplotlib that can allow the user to make these presentations with less effort. In this tutorial some simple changes were examined that would make a graph look much more pleasing to the eye with only a few steps. There are many other features and types of graphs that can be used in Seaborn any number of ways so it is encouraged that the aspiring programmer become familiar with this library.

Links
The Pokemon database:
https://www.kaggle.com/abcsds/pokemonSeaborn Documentation on barplots:
https://seaborn.pydata.org/generated/seaborn.barplot.html
Matplotlib documentation on the Axes class:
https://matplotlib.org/api/axes_api.html#axis-labels-title-and-legend
Seaborn Styling tutorial:
https://www.codecademy.com/articles/seaborn-design-i#:~:text=Seaborn%20has%20five%20built%2Din,better%20suit%20your%20presentation%20needs.

Apache Spark in English

RCKettel — Sun, 27 Sep 2020 04:54:50 +0000

For an assignment I was asked to choose a datas science library or tool and write about it so I chose Apache Spark after a friend who has been helping me here and there mentioned it to me.

Apache Spark is an open source extremely fast cluster computing system for large scale data processing and is meant to be used in general for batch based and real time processing[1]. A cluster computing system uses many machines to accomplish computing tasks using the pooled resources on a network. This allows the system to be far more flexible in the amount of processing resources available as more computers can be added and removed at any time. This, in turn, allows a cluster computing system to have a greater pool of resources. If one of the computers in the cluster should fail the processing power of the other computers will pick up the slack and continue running the process[7]. This, as you might imagine, makes these systems capable of processing large amounts of data that could consist of a varying amount of files or records at once as with batch data or allow it to consistently react to data within a very tight window of time that would be required for a real world processes as with real time processing[4]. This leaves a large amount of room for high level computing processes that could be done on many platforms and with many varied and different tools as well as the way the system processes the data for speed.

Of the many problems that are addressed by Apache Spark one example is represented by the importance of speed, IP network traffic is expected to reach 396 exabytes per month by 2022 an increase of 274 petabytes per month from just five years before in 2017. One of the solutions that have been successfully applied is in-memory cluster computing, which can mean many different things depending on what system is being applied and why. To summarize, many groups are attempting to simplify the way the computer processes data from the hard disk. The reason for this is in traditional systems data is processed from memory which causes slowing in the amount of data processed at a given time and can raise the amount of power demand for the system. Traditionally the way many companies overcome this problem is through scaling or making their devices smaller and more flexible. However, another method applied by database vendors is to process the data in the main memory or DRAM rather than storing It in solid state drives or disk drives on a server or another system. This accelerates the speed of transactions[3]. Apache Spark is hailed by many to have accomplished the application of this system of data processing in a way that others systems like it have not, making it 100 times faster in memory and ten times faster on disk when compared to other systems such as MapReduce giving it a reputation for low latency[5]. Another problem that Spark attempts to solve is the abundance of other systems and high level programming demands that could be placed on it by making it accessible to many different processes.

Spark has six major spaces where it fits best for compute. Those six are fast data processing, iterative processing, near real-time processing, graph processing, machine learning, and joining datasets. I have already discussed how it’s data processing speed is established and how this translates to near real-time processing so I will begin by discussing iterative processing. Spark uses a Resilient Distributed Dataset or RDD which is an immutable dataset that is split up and then replicated among multiple nodes such that if one node should fail the others will still process the data[5]. This makes Spark adept at processing and reprocessing the same data very quickly since it can enable many different operations in memory[2]. Apache Spark comes with a number of preloaded system among them is GraphX which due to it’s affinity to iterative processing enables it to use RDDs for graph computation. A machine learning library is included with Spark to allow for better use in that regard and due to its speed can create combinations at a decent speed[5]. All of these things make Spark very comparable in the market to its main alternatives.

Having looked around there are many alternatives to Spark though some are alternatives to the one of the functionalities included in the program like graph processing or streaming. For this discussion I would like to focus on the programs meant specifically for big data processing specifically Apache Storm, Apache Fink, IBM InfoSphere Streams, and TIBCO StreamBase.

Apache Storm is another open source program designed for stream processing and near real-time event processing. In addition to its ability to do many of the same functions as Spark such as online machine learning and real-time analytics. It comes with a group of built in programs designed for functions such as cluster management, queued messaging, and multicast messaging. Storm can be used with any programming language making it more flexible than spark in that way[6].

Apache Flink is not a micro-batch model, instead, it uses an operator based model for computation. Using this system all data elements are immediately pipelines using the included program for streaming as quickly as they are received. Flink is faster with graph processing and machine learning due to its propensity for closed loop iterations and is comparable in speed while allowing code from programs like Storm and MapReduce[6].

IBM InfoSphere Steams has everything necessary for stream processing including integration abilities as well as a heavily scalable event server. This program will uncover patterns such as data flows in the information during the period and can fuse the streams that can assist in gaining insights from many different streams. Streams comes with security software and network management features, a runtime environment for deployment and monitoring stream applications, and finally a programming model for writing applications in SPL[6].

TIBCO Streambase is mainly for analysis of real-time data and creation of applications that support developers that make those applications such that they will be faster and easier to deploy. This program is unique for its LiveView data mart that utilizes continuously streaming data from real-time sources and creates an in-memory warehouse to store data and afterward return push-based query outputs to the users. Users of this program can reflect on the returned data and use elements meant to make the desktop more like an interactive command application for the users[6].

Apache spark was originally developed at UC Berkeley in 2009 and is an entirely open source project currently being hosted by the Apache Software Foundation and is maintained by Databricks[1]. If you are interested in more information please investigate the links below.

The Apache Spark Tutorial
https://www.tutorialspoint.com/apache_spark/index.htm

The Spark Quickstart guide
https://spark.apache.org/docs/latest/quick-start.html

Resources

Apache Spark™ - What is Spark. (2020, April 13). Retrieved September 27, 2020, from https://databricks.com/spark/about
Bekker, A. (2017, September 14). Spark vs. Hadoop MapReduce: Which big data framework to choose [Web log post]. Retrieved September 26, 2020, from https://www.scnsoft.com/blog/spark-vs-hadoop-mapreduce#:~:text=In%20fact%2C%20the%20key%20difference,up%20to%20100%20times%20faster.
Lapedus, M. (2019, February 21). In-Memory Vs. Near-Memory Computing [Web log post]. Retrieved 2020, from https://semiengineering.com/in-memory-vs-near-memory-computing
Schiff, L. (2020, May 13). Real Time vs Batch Processing vs Stream Processing [Web log post]. Retrieved 2020, from https://www.bmc.com/blogs/batch-processing-stream-processing-real-time/
Vaidya, N. (2019, May 22). [Web log post]. Retrieved 2020, from https://www.edureka.co/blog/spark-architecture/
Verma, A. (2018, May 25). What are the Best Alternatives for Apache Spark? [Web log post]. Retrieved September 26, 2020, from https://www.whizlabs.com/blog/apache-spark-alternatives
What is Cluster Computing: A Concise Guide to Cluster Computing. (2020, May 18). Retrieved September 27, 2020, from https://www.educba.com/what-is-cluster-computing/

Why did I choose Data Science?

RCKettel — Tue, 08 Sep 2020 04:59:18 +0000

In the last two weeks I have begun taking data science classes at a major bootcamp in Seattle Wa. There are a few major reasons I had chosen this specific avenue especially after already having a four year degree and job. I will begin by discussing my previous education and career path.
I had originally earned a degree in general archaeology from Western Washington University in 2012. When I had begun looking for a job, I searched forums like Shovelbums.com and USAjobs.org and in a period of about eight months I had only a few short jobs that had only lasted a few days to a weak. For this reason I began looking for other jobs in other career paths. I primarily found temporary jobs doing warehouse work, construction, janitorial, or retail work. Eventually, in 2017 I found a job moving furniture for a friend of the family in which I moved office furniture for a little over a year and then took a job doing the same work at another company for another nearly two years. While I was doing this work I realized that I was never going to be paid enough to have a sustaining career or earn a decent retirement. For both of those reasons, I began looking for a new career.
While doing my research I found out about computer science bootcamps and what they had to offer. Unlike four year degree programs, bootcamps tend to specialize rather than generalize. A four year degree will give the student a more rounded education in mathematics, algorithms, OS design and programming languages that will prepare them by essentially giving the student a knowledge that allows them to be more flexible and have a better understanding of code. This will often give the prospective employee a better ability to learn new coding programs faster. While this is a great advantage and will often lead to good career prospects, many CS degree programs have low acceptance rates and high costs(1).

A bootcamp however, is going to give a student the major groundwork necessary in programing languages and theory that is specific to their career path such as Python, Github, and APIs and do so in a short amount of time(2). These educational courses often will include making a portfolio and job counseling to help the student find a new profession. Indeed, many of the bootcamps state a very high placement rate. The drawback is a person from bootcamp is less likely to land a major placement such as a senior level role than a person with a four year degree unless hired from within the company especially if that job is highly specialized. In both cases students will often land jobs that will be comparable in prestige and pay as entry level careers(1).
The reasons I chose a boot camp over a four year degree are cost and time. I started college late and regretfully, never went back for my masters degree. For this reason I have changed careers at the age of 35 meaning I would be nearly forty years old by the time I get my degree provided I would be able to get accepted into a program and the cost to me would be exorbitant. An additional $40-50,000 dollars in school loans, more if I attended an out of state college. Meanwhile, a bootcamp will often be about a third the cost at about $14,000 on average and take less time at an average of about three months(1). Upon researching the possible career paths I happened upon Data Science.
This particular brand of programming has grown to become one of the fastest growing and widely used tools among many different companies and fields. Specifically, data scientists collect, process, and analyze data to present answers to questions for companies that need the useful information to make proper decisions for their future business strategies. Since the methods involved are so necessary to companies that need more and more accessible data many companies from Convoy which uses data science in the trucking industry in North America to Flatiron Health who uses data science the efforts for cancer research(3), are looking to hire data scientists. According to Glassdoor.com data scientist is the number three best job in the nation in 2020 and has a median base salary over $100k with a fairly decent job satisfaction score(4). Data science then seemed like a good job that was used in a broad spectrum of different fields that I may want to contribute to. At that point I simply needed to decide on a specific school.

I looked at a number of schools but most seemed to be overly short and I was worried I would not be adequately prepared for a job in my new career if I didn't enough time on the subject. In addition, the cost of the classes were a big sticking point for me. Some schools had a payment strategy for your education called pay-share that allowed a student to forgo paying the cost of their classes until they had been placed in a position. This allowed the student to give up part of each paycheck to pay off the cost of school without taking out a loan. A student could also use the traditional methods such as paying out of pocket or paying with school loans through an institution partnered with the school. Sometimes these partners will have short windows to pay back a loan. The lending institution at the school I am currently participating in had a loan at a reasonable APR for my credit score but the payback period was much shorter than if I had gone to a bank at about 36 months. I decided instead to get a personal loan through an independent institution. I think the thing that made me choose the school I went with, Flatiron School, was its money back guarantee. It stated if the student was not offered a job within three months of graduation the student would get their money returned to them. I appreciated the integrity of a school that stood behind its curriculum in such a way. I now have thirteen more weeks of school and hope to begin my new career soon.

Nguyen, M. (2017, September 13). Coding Bootcamps vs CS Degrees – 5 Main Differences [Web log post]. Retrieved September 07, 2020, from https://www.codingdojo.com/blog/coding-bootcamps-vs-cs-degrees#:~:text=Time, Money, & Opportunity,a bootcamp has multiple layers.&text=Bootcamp duration can range from,bootcamps can be very enticing.
Williams, A. (2020, April 16). Coding Bootcamp Vs. College [Web log post]. Retrieved September 07, 2020, from https://www.coursereport.com/blog/coding-bootcamp-vs-college
Bowne-Anderson, H. (2018, August 15). What Data Scientists Really Do, According to 35 Data Scientists. Retrieved September 07, 2020, from https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists
Best Jobs in America. (n.d.). Retrieved September 08, 2020, from https://www.glassdoor.com/List/Best-Jobs-in-America-LST_KQ0,20.htm