New_The Data Structure Explained
The objective is to initiate the development process of a data warehouse (DW) and take its initial steps. On top of the DW, we intend to develop the PBI reports. Traditionally, a data warehouse involves transferring data from a transactional database to a dedicated repository at specific intervals. However, our current stage does not involve this transfer. Instead, we establish an intermediary layer with a warehouse-like design consisting of DIM (dimension) and FACT tables. These tables are views created from our relational database tables, aiming to replicate the structure and functionality of a data warehouse.
We currently don't possess expert-level knowledge in data warehousing but have studied its principles. Consequently, the design might need to be improved.
Getting this right is very important to us, and therefore, I would appreciate your insights regarding the design and any recommendations you have. Perhaps we should discuss this person-to-person and ensure we have a proper data model to move forward.
Please let me try and explain the design and the details.
FACT_LANDINGS
FACT_LANDINGS contains duplicated rows. Should we consider all of them - which will cause cartesian on the final metrics results - or should we consider removing duplicates? Should we consider the same rule for the other fact/dimension tables if that is the case?
Within our relational database, we have three interconnected tables: SURVEY, SURVEY_ITEM, and SURVEY_ITEM_DETAILS, each associated through one-to-many relationships.
Our current endeavor involves crafting a data model that effectively tackles the following scenarios.
SURVEY = Survey Session, master
SURVEY_ITEM = Interview, detail1
SYRVEY_ITEM_DETAILS = Catches, detail2
By designing FACT_LANDINGS we flatten this data structure. So for each combination of Species, Gear, Fishing_Zone and Condition_id a new line is created in the database (at the third level) and therefore there are repetitions in the two levels above.
Lucas it is possible to work with this in a flattened table (FACT_LANDINGS) or would you recommend to for us to create an API for SURVEY, SURVEY_ITEM, SYRVEY_ITEM_DETAILS?
SELECT
s.survey_id,
s.survey_date AS ldate,
s.landing_site_id,
si.vessel_id,
sid.species_id,
sid.weight,
sid.price,
si.fuel_used,
sid.weight * sid.price AS total,
sid.gear_id,
sid.fishing_zone_id,
sid.condition_id,
s.tenant_id,
s.is_active,
s.created_by,
s.created_time
FROM
survey s,
survey_item si,
survey_item_dtl sid
WHERE
s.survey_id = si.survey_id (+)
AND si.survey_item_id = sid.survey_item_id(+)
DIM_DATES
What is the purpose of the DIM_DATES table? We intend to create a new dynamic calendar table to encompass all dates period - from Jan 1th to Dec 31th of all your data.
DIM_DATES contains duplicate values for the indicated key, which causes a N:N relationship in Power BI, which is not recommended
The purpose of DIM_DATES is simply to extract the dates found in table SURVEY. Perhaps we should redesign and have all possible dates?
Please advise!
SELECT
survey.survey_date,
survey.survey_id,
EXTRACT(DAY FROM survey.survey_date) "DAY",
EXTRACT(MONTH FROM survey.survey_date) "MONTH",
EXTRACT(YEAR FROM survey.survey_date) "YEAR",
to_char(survey.survey_date, 'IW' ) "WEEK",
to_char(survey.survey_date, 'Q' ) "QUARTER",
survey.tenant_id,
survey.is_active
FROM
survey
WHERE
is_active = 0
DIM_PARTIES
DIM_PARTIES only matches the join criteria for PartyID on the fact tables, not with OPERATORID and OWNERID. Should DIM_PARTIES consider OPERATORID and OWNERID for their relationship to the fact tables? If yes, what are the corresponding keys on the fact tables?
PARTY_ID should match the join for OPERATORID and OWNERID in FACT_VESSELS—the following works.
SELECT
datistica.fact_vessels.vessel_id,
datistica.fact_vessels.vessel_name,
datistica.fact_vessels.registration_no,
datistica.fact_vessels.district_no,
datistica.dim_parties.party_id,
datistica.fact_vessels.operator_id,
datistica.dim_parties.first_name as "OPERATOR_FIRST_NAME",
datistica.fact_vessels.owner_id,
dim_parties1.first_name AS "OWNER_FIRST_NAME"
FROM
datistica.fact_vessels
LEFT JOIN datistica.dim_parties ON datistica.fact_vessels.owner_id = datistica.dim_parties.party_id
LEFT JOIN datistica.dim_parties dim_parties1 ON datistica.fact_vessels.owner_id = dim_parties1.party_id
VESSEL_ID | VESSEL_NAME | REGISTRATION_NO | PARTY_ID | OPERATOR_ID | OPERATOR_FIRST_NAME | OWNER_ID | OWNER_FIRST_NAME |
2056 | Easy Does It / Niesha | V4-469-BE | 14359 | 14359 | Shenequa | 14359 | Shenequa |
2039 | God's Grace | V4-615-SP | 14361 | 14307 | Silas | 14361 | Silas |
2022 | Busy | V4-470-OL | 14362 | 14362 | Spencer | 14362 | Spencer |
2022 | Busy | V4-470-OL | 14362 | 14362 | Spencer | 14362 | Spencer |
1986 | TERENZ | V4-308-LH | 14363 | 14363 | St. | 14363 | St. |
1872 | First Time | V4-039-BE | 14364 | 14364 | Stuyesant | 14364 | Stuyesant |
1884 | Just In Time | V4-451-BE | 14365 | 14365 | Sydney | 14365 | Sydney |
1877 | Sea Dreams | V4-497-OL | 14366 | 14366 | Sylpatra | 14366 | Sylpatra |
DIM_COORDINATES
DIM_COORDINATES do not relate to any other table. Is it just for map plotting purposes?
I don't think we are using DIM_COORDINATES as of now. The only coordinates we are using are in the DIM_LANDING_SITES.