According to a new report published today by Glassdoor, data scientists hold the best jobs in America. For those curious, a “data scientist” typically refers to a mix of skills, part statistician and part computer programmer. For instance, data scientists often have to employ computer code (like the Python programming language - one of our favorite!!) to scrape the web for data that may not be in a neatly packaged format, whereas a straight “statistician” is conventionally hyper-focused on sophisticated data analysis techniques.
At Greenlight VR, we do a lot of data processing and use many modern and tried-and-true data science methods in the normal course of business. In the best case, the data we analyze is available in nicely formatted datasets. However, most of the time, the data is made available piece-wise through large datasets or collections of poorly formatted documents scattered across the web. To resolve this problem, we build data processing pipelines.
Take for instance our data pipeline for Virtual Reality Wire. The data we aggregate for this analysis is available online on a publicly accessible site with links to each document in the data collection. Sometimes this data is static and other times new documents are added frequently. To handle new documents appearing frequently, we constructed a system to continuously grab the new documents as they become available in addition to your initial data ingestion. In this pipeline, we have an asynchronous job queue to orchestrate every element of the pipeline including document download, ingestion, and cleansing. There are even two separate tasks for ingesting initial data and ingesting newly updated documents. A separate library is used to download all the documents, and special techniques to parallelize the requests. Next, we perform some parsing to retrieve the necessary facts from the documents. At this point, we store the raw data to its final destination or perform further data cleansing. Combined together, we've created an efficient and scalable data processing pipeline to analyze industry developments at publicly-traded companies.
But, for all the strength of our tools and methods, it is our analysts that will put it all to work analyzing the future on behalf of our clients. Our singular focus allows us to commit vast resources towards broader and deeper industry analyst coverage, and to deliver timely insight for our clients.