There is an enormous amount of data present in many different formats, including databases (MsSql, MySQL, etc.), data repositories (.txt, html, pdf, etc.), and MongoDB (NoSQL, etc.). The processing, storing, and management of the data are complicated by the varied locations in which the data is stored. If combined, this data from several sites can yield a lot of important information. Since many researchers have suggested different methods to extract, examine, and integrate the data. To manage heterogeneous data, researchers propose data warehouse and big data as solutions. However, when it comes to handling a variety of data, each of these methods have limitations. It is necessary to comprehend and use this information, as well as to evaluate the massive quantities that are increasing day by day. We propose a solution that facilitates data extraction from a variety of sources. It involves two steps: first, it extracts the pertinent data, and second, then to identify the machine learning algorithm to analyze the data. This paper proposes a system for retrieving data from many sources, such as databases, data sources, and NoSQL. Later, the framework was put to the test on a variety of datasets to extract and integrate data from diverse sources, and it was found that the integrated dataset performed better than the individual datasets in terms of accuracy, management, storage, and other factors. Thus, our prototype scales and functions effectively as the number of heterogeneous data sources increases.
Data is growing at an exponential rate, with the last two years alone generating 80-90% of the world's data . There are currently 555 million websites, up 300 million from the previous year , making for a diverse range of data sources, each with its own framework and structure. The issue of user-desired quantitative methodology must be recognized and resolved as soon as possible. Big data extraction now faces additional difficulties as heterogeneous data sets proliferate quickly. Since there are numerous ways of sources, including social media, news, government departments, the health sector, the agriculture sector, and other more departments, and this is increasing on a daily basis, scholars and researchers are aware of the value of the data in order to extract patterns from the raw data into information. These sources of information can contain different types of data, which include data stored at database levels, and various file systems like pdf/txt and HTML file formats. So in order to deliver real insights, we need a unique system to function and execute queries from one or more data sources .
The handling of data from numerous sources presents challenges for many departments and organizations. These data sources include duplicate data, time series data, and internet transactions. Hospitals are one of the best examples of heterogeneous data environments, where information about patients is kept in one database while information about X-rays, medications, and other information is kept in other databases due to the need to manage both tabular and image data for every single patient. To meet accreditation requirements, this data needs to be normally maintained, organised, accessed, and assessed in accordance with a defined format. Other organizations, such as meteorological departments, also struggle with handling the data because different rainfall data characteristics are stored on a regular basis. This poses a major threat to the efficiency of database systems. Satellite data, GIS data, and data gathered from RADAR systems are all included in the data that the meteorological agencies retain. These forms of data are extremely difficult to manage and require a lot of processing power for storage, retrieval, and management. Furthermore, many of this data is stored in an unstructured manner using a number of languages and formats. Traditional methods of data management are useless because there is so much data and it is so complex , .
Consequently, presenting the diverse unstructured material in a systematic fashion will always be difficult. Data warehousing, big data, and other approaches have all been proposed as ways to address the heterogeneity challenges from various levels of granularity, but each has its own set of drawbacks when it comes to effectively managing this varied mix of data .
One common method for addressing data heterogeneity problems is to transform and combine them into a single data source , , , . Lexical, syntactic, and geometrical inconsistencies that arise during data integration may cause some information to be lost when all of the data sources are integrated into one source, increasing the storage capacity requirement to that of a data warehouse.
Rehman et al.  offer a method for getting better data collection outcomes. The model is said to reduce data in the beginning stages. The suggested approach in this study did not take into account the usage of ML approaches or data processing; instead, it concentrated simply on data minimization.
Machine learning is now required in order to manage the data due to the increase in data from sources such as literature, databases, and repositories. The author's research  aims to review works involving knowledge discovery in various data sources using machine learning techniques. The fundamental ideas, typical tools, and implementation of ML in this are also summarized. This data generated from heterogeneous sources have been analyzed using machine learning techniques.
In the study , the authors covered the issues that can occur when integrating diverse data at a metallurgical facility. They introduced a model of information for defining the specifications of information on metallurgical output. The authors examined integration approaches and the potential for applying them to integrate metallurgical businesses with diverse data.
Hashim et al.  provide an integrated, mediated, and data warehouse-based architecture for cognitive integration. They employ two different kinds of ontologies—local and global ontologies—to direct the integration and manage syntactic, structural, and semantic heterogeneity.
The ideas, relations, and characteristics are treated as primordial and, as such, irreducible entities in the  study's suggested investigation of ontology-driven semantic data integration in an open environment. In light of the intentional conceptualization model, the formal intentional account of both ontology and ontological commitment is also proposed. Additionally, a proposed intentional model for ontology-driven mediated data integration in an open context. The suggested model takes into account the open environment's dynamic nature and intentionally describes the data from data sources. The link between global and local ontologies is then defined, together with the formal intentional semantics of query response.
All of the above studies had limitations, which leads us to offer a possible solution to the heterogeneity problem. Data processing, data management, and, most importantly, data storage is among the current problems. Many researchers are motivated by data processing to create complex heterogeneous systems; nevertheless, evaluating vast volumes of data demands greater computer capacity. Furthermore, data management and storage are perpetual issues, with no comprehensive answer in sight. As a result, the primary objective for undertaking this research is to address these current issues. All of the systems under discussion support just a limited number of data sources. Wrappers must be generated manually or hard-coded, and only a few data sources , , , , , ,  appear to be supported.
Many academics have proposed a variety of approaches to handle the heterogeneous data at different granularity levels, however all systems have some limitations. Data warehouse and big data are a couple of these. These already-used options each have their own set of drawbacks. The following is a discussion of the drawbacks of each currently in use solution:
In a variety of industries, data warehouses have been developed , , , . On the other hand, modern DWs face brand-new scientific challenges. Today's data sources are, in fact, numerous, independent, adaptable, and distributed. Due to these difficulties, traditional data warehouses are constrained in some ways, including in terms of data essence, availability, storage methods, and so forth. According to reference , this is because of a lack of scalability caused by processing difficulties combined with inherent data problems and restrictions on the underlying hardware, application software, and other infrastructure. When employing data warehouses to address data heterogeneity, many researchers currently encounter a number of problems, including the following:
1. A major organization's data warehouse construction is a difficult operation that can take years to complete .
2. Data warehouse administration is also challenging and needs a team with advanced technological knowledge .
3. Monitoring data ware issues for quality when heterogeneity of data is taken into account, both quality and consistency of data are not up to par .
In order to analyze massive amounts of data, big data systems need high-speed computer infrastructure, which can be costly in terms of data collection, storage, processing, and visualization , , , , , , , , .
Big data systems have a number of challenges, one of which is the need to protect corporate and individual privacy through verification and security. These problems include data management, which may be expensive and time-consuming when done at a heterogeneous level. Additionally, managing the storage of data that is impressively large in size can be tough since it is always a difficult process when storage is taken into account. Processing such a vast volume of data continues to be a major difficulty since it always seems to be problematic. The management of data from several heterogeneous sources is the final but not the least problem. When you have data from one or more sources with diverse structures and from different platforms, a number of challenges are associated.
Due to the fact that each of the two workable options so far has its own set of limitations. Therefore, in addition to this, we need to find a better solution that will take care of all the problems that may arise in the future.
The framework that enables researchers to examine data from several heterogeneous sources, such as various databases, data repositories, and other sources, is the subject of this section. The difficulty of data analysis from various sources will rise because there are n databases with n*m tables and that number is growing exponentially.
By separately extracting the data from each of these data sources, our framework enables the utilization of the data from a variety of data sources for analysis. In our method, the user queries the system, and based on the query, the analytics engine invokes some analytic models, fetching the data to the staging server regardless of structure. The model sends data requests to the data source interface while it is running. Different data interface procedures have been created in the data source interface (Staging server) to retrieve data in the format required by the analytical model. The element that engages with the various data sources is called wrappers . DI routines are in charge of query creation, query execution, formatting, and returning the result data to the analytical system. It should be independent of the type, size, and structure of the data in order to evaluate the heterogeneous data and test the scalability. We make the assumption that there is no centralization of the data from the indicated sources prior to conducting the experiment. When sources are identified, it also indicates that meta data, which contains details about the data, is present. The method is that all sources—including different databases (MySQL etc), data repositories (txt, html, pdf, etc.), and NoSQL (MongoDB) —are shown in below (Figure 1).
A staging server ,  or temporary server is also available. There is no data source stored on the staging server. The data is inserted into this staging server in line with the structure after being taken from various sources (databases, data repositories, and NoSQL). Thus, the data in staging server is flushed without storing the data, every time the process finishes. The potential issue with this is that it may result in data redundancy or that the data we get is not normalized, but the important focus here is data retrieval alone. Following data retrieval, we can undertake any normalization strategy.
The experiment to analyze the data from the heterogeneous sources follows a two fold process:
1. Extract the relevant information.
2. Identification of machine learning algorithm to analyze the data.
For the extractions of relevant information we have:
(1). Identification of sources:
The server administrator is in responsible of this, thus there may be some manual work involved. Every repository that is present must have some basic information added by the administrator.
(2). Data extraction:
Keywords, sets of numbers, characters, and other data extraction items provide the type of data for which information is required from various sources.
Accessing the meta data is the initial stage in the data extraction process. The data may be present from the source in one of three ways. Databases, data repositories, and NoSQL are all included in this. Each of the n kinds of the database must be visited. We also have several repositories and NoSQL. Because every database has a unique structure, there is a lot of heterogeneity present in databases. However, because NoSQL databases all have the same structure, there is very little heterogeneity (Figure 2).
The staging server acts as a data source interface where requests are sent to the module from different analytics models of the analytics engine. The data interface function is aware of the schema of the data source it needs to connect to. In order to retrieve the required data, the data interface procedure formulates the query in accordance with the data source's format. The wrappers that are created to access a certain sort of source are really the information that is extracted. Consequently, the process of identification and extraction is finished.
The framework was implemented on following structure shown below (Figure 3). The first phase information was distributed into multiple sources. Subsequently decision tree (Iterative Dichotomiser 3) algorithm was implemented on it. A decision tree is constructed by recursively dividing data divisions into smaller partitions based on splitting rules or criteria. A heuristic for selecting criteria that optimally separates a class labelled training dataset into different classes is an attribute selection measure or splitting rules. The attribute selection measure should be such that splitting produces pure partitions, meaning that all entries in a given partition belong to the same class. The ID3 Algorithm is a decision tree algorithm that works on the information gain and entropy. In the next phase framework was implemented, data was transformed and loaded.
Once the data is accumulated and stored in the staging server appropriate algorithm is applied on the data based on structure, volume etc of the data. After the data from the different sources have been extracted, it is time to select a machine learning algorithm to look at the data. Since the primary purpose of this work is to examine a variety of data sources without identifying the data, consequently, Iterative Dichotomizer 3 (ID3) algorithm have been applied on the data which was extracted from different sources and the results are generated accordingly.
The experiment was carried out on a geographical dataset collected from different data sources as shown in the Figure 3. In this paper we have implemented a basic classification algorithm in order to check the accuracy level at each different sources and latter on the same data is integrated into one file and the same algorithm is implemented to check the final accuracy after data integration.
The data collected from different heterogeneous sources consists of different meteorological parameters including Humidity at 12 am and 3 pm, minimum and maximum temperature, season and rainfall as the target parameter. The integrated dataset contains around 6,000 records. In order to check the accuracies at individual levels and at the final level, we have divided the dataset into 70% training and 30% testing ratio. The experiment results which we obtained in processing the data is shown below Table 1. It was discovered that the source 1, source 2, and source 3 datasets had few parameters, resulting in lower accuracy when compared to integrated data. The reason for the high accuracy measure is that the integrated data contains all of the parameters present in all of the source data, and it is obvious that the more parameters in a data set, the greater the chances of high and correct accuracy measures, and the results are as per the ID3 algorithm.
The graphical representation of the experimental results is shown below (Figure 4). As we can see the accuracy measures at different data sources are comparatively less and the same integrated dataset has an accuracy measure of 91%, which shows the proposed solution works well as compared to data stored at different granularity levels.
Heterogeneity with volume makes integration difficult and if the same is complemented by multiple sources a problem is manifold , , . Machine learning over the years is progressing and has attained standardization based on type of algorithms. However scalability and granularity is still random not standardized.
In this paper, a framework is proposed which will retrieve data from numerous sources including databases, data sources and NoSQL. Stored on varying infrastructure it is not humanly possible to have machine learning algorithm implemented on multiple sources in isolation. This will generate contradictory results and may lack correlation between varying parameters. Accordingly, a framework is proposed which shall work on varying sources, structures and shall also integrate data into staging server where upon machine learning algorithms are used.
Staging server is temporary storage data storage (buffer) which is used to hold integrated data temporarily while as meta data (knowledge base) is permanent data storage having information about whole framework including source, also deletion, modification of repositories.
Furthermore, a typical (ID3) technique was then used on data obtained from various sources and then on an integrated source leads to accuracy measurement. The accuracy was computed using the decision trees IF-THEN else rules. Since we may improve the data's correctness by applying other decision tree extensors such as random forests, extra trees, and several ensembled techniques, this will serve as a benchmark for all other methods implemented.
Due to the fact that numerous methods have been offered by numerous researchers throughout the years in order to examine the data from diverse sources. The currently suggested approach has a lot of problems and restrictions, such management and data storage. The difficulties that now exist and the suggested solutions that are offered in this paper are tabulated below (Table 2) as part of our attempt to address them.
Proposed solution results
It is clear that since we are not transferring the data to new infrastructure, there are no additional administration charges or handling requirements.
We used the idea of a staging server to address this problem. The staging server is flushed each time the request has been handled. We can avoid using additional storage by using this.
Non-managed files are not covered by the staging server for the purposes of delta data capture and data propagation. The non-managed files must be manually copied from the staging server to the production server. This doesn’t affect the overall processing of the system servers.
A framework is proposed which shall work on varying sources, structures and shall also integrate data into staging server where upon machine learning algorithms are used. This will help us to tackle with the heterogeneous sources.
For extracting and analyzing the data from the heterogeneous source, we described a framework in this study. This framework performs distributed cross-source join operations and allows users to specify changes that enable joinability instantly at query time. This framework uses a two-step procedure in which it first extracts data from various sources and then uses any conventional machine learning algorithm to analyze the information. By depending on the connectors of the latter, this architecture eliminates users from manually creating wrappers, which is a major bottleneck in enabling data variety throughout the literature. It was discovered that the integrated dataset outperformed the component datasets in terms of accuracy, management, storage, and other characteristics when the framework was tested on a variety of datasets to extract and integrate data from various sources. The performance of the suggested framework on various datasets with increased levels of variability will be examined in the future. This will enable us to test the framework's scalability in diverse settings.
Iqbal Hassan, S.A.M. Rizvi & Majid Zaman: Framed the main idea of the work, implementation, interprets the results, Data curation, Data collection and Visualization. Waseem Jeelani Bakshi: Study plans with all authors. Provides the basic idea of the work, Design and draft of the manuscript. Sheikh Amir Fayaz: Study conception and Investigation, testing of the results, editing of the manuscript and paper writing.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that they have no conflicts of interest.