Data Extraction Techniques
Data is everywhere. And much of it is highly sensitive and personal,and needs to be very carefully guarded for privacy and other concerns.
There are many techniques for extracting data,depending on the kind of data source and the intended use of the data. Examples include:
- Optical character recognition (OCR), which is used to interpret and digitize text scanned from paper documents so it can be stored as a computer-readable file.
- Analog-to-digital converters (ADCs), which can digitize analog audio recordings and signals,and charge-coupled devices (CCDs) that capture and digitize images Opinions, questionnaires, and vital statistical data obtained through polling and census methods Cookies, user logs, and other methods used for tracking human or system behavior
- Web scraping, used to crawl web pages in search of text, images, tables, and hyperlinks.
- APIs, which are readily available for extracting data from all sorts of online data repositories and feeds, such as government bureaus of statistics, libraries,weather networks, online shopping, and social networks.
- SQL languages for querying relational databases,and
- NoSQL for querying document, key-value, graph or other non-structured data repositories.
- Edge computing devices, such as video cameras that have built-in processing that can extract features from raw data.
- Biomedical devices, such as microfluidic arrays that can extract DNA sequences.
Few high-level examples of use cases,along with their raw data sources and extraction techniques are below -
- You can use APIs to extract data from multiple structured data sources for integration into a central repository.You can also use APIs to capture periodic or asynchronous events to store them in a history archive.
- Rather than transmitting potentially very large volumes of redundant data from IoT devices, you can use edge computing to reduce that data volume by extracting features of interest from the raw data. Often, this kind of extraction at the source is impractical so the data is migrated to storage as-is for further processing, analysis, or modeling.
- You can use medical imaging devices and biometric sensors to acquire data for diagnostic purposes.